Skip to content

Refactor RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool#121

Merged
rayruizhiliao merged 33 commits intomainfrom
refactor_discovery_agent
Feb 3, 2026
Merged

Refactor RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool#121
rayruizhiliao merged 33 commits intomainfrom
refactor_discovery_agent

Conversation

@rayruizhiliao
Copy link
Contributor

@rayruizhiliao rayruizhiliao commented Jan 30, 2026

This PR

  • refactors RoutineDiscoveryAgent from a hardcoded sequential pipeline to an LLM-driven agentic loop with tools
  • adds a execute_routine tool as part of the agent's loop for catching errors before finalizing the routine

@rayruizhiliao rayruizhiliao changed the title feat: agentic loop for routine discovery Refactored RoutineDiscoveryAgent from a hardcoded sequential pipeline to an LLM-driven agentic loop with tools Jan 30, 2026
@rayruizhiliao rayruizhiliao changed the base branch from network-spy to main January 30, 2026 22:58
@dimavrem22
Copy link
Contributor

dimavrem22 commented Jan 30, 2026

Few thoughts:

I would probably yank these for now then add back in when agent works: record_identified_transaction, record_extracted_variables, record_resolved_variable

--

Edited at 6:24p, 1/30

Dima: Potentially remove queue management tools and manually write them into the agent run flow? Alternatively be very specific about the queue conditions when the agent can exit the loop

@dimavrem22
Copy link
Contributor

could likely consolidate these construct_routine, finalize_routine since agent has access to docs and will get access to execution

@rayruizhiliao
Copy link
Contributor Author

could likely consolidate these construct_routine, finalize_routine since agent has access to docs and will get access to execution

consolidated in commit cc148fd

@rayruizhiliao
Copy link
Contributor Author

Few thoughts:

I would probably yank these for now then add back in when agent works: record_identified_transaction, record_extracted_variables, record_resolved_variable

--

Edited at 6:24p, 1/30

Dima: Potentially remove queue management tools and manually write them into the agent run flow? Alternatively be very specific about the queue conditions when the agent can exit the loop

I decided to keep phase, because validating routine format and executing routine could both impact phase transition. The agentic loop exits only when phase is COMPLETE OR max iterations are reached. Queue status influences the LLM's behavior which eventually leads to phase transitions, but the loop itself only checks the phase, not the queue directly:

            # If no tool calls and not complete, prompt the agent to continue
            elif self._state.phase != DiscoveryPhase.COMPLETE:
                status = self._state.get_queue_status()
                prompt = (
                    f"[ACTION REQUIRED] Current phase: {self._state.phase.value}. "
                    f"Queue: {status['pending_count']} pending, {status['processed_count']} processed. "
                )
                if self._state.phase == DiscoveryPhase.IDENTIFY_TRANSACTION:
                    prompt += "Use list_transactions and get_transaction to find the relevant transaction, then record_identified_transaction."
                elif self._state.phase == DiscoveryPhase.PROCESS_QUEUE:
                    if status['current']:
                        prompt += f"Currently processing: {status['current']}. Extract and resolve variables, then mark_transaction_complete."
                    elif status['pending_count'] > 0:
                        prompt += "Get the next transaction from the queue."
                    else:
                        prompt += "Queue is empty. Call construct_routine to build the routine."
                elif self._state.phase == DiscoveryPhase.CONSTRUCT_ROUTINE:
                    prompt += "Build the routine using construct_routine."
                elif self._state.phase == DiscoveryPhase.VALIDATE_ROUTINE:
                    prompt += "Validate the routine using execute_routine with test parameters."

                self._add_to_message_history("system", prompt)

@rayruizhiliao rayruizhiliao changed the title Refactored RoutineDiscoveryAgent from a hardcoded sequential pipeline to an LLM-driven agentic loop with tools Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add "execute_routine" tool Feb 1, 2026
@rayruizhiliao rayruizhiliao changed the title Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add "execute_routine" tool Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool Feb 1, 2026
@rayruizhiliao rayruizhiliao marked this pull request as ready for review February 1, 2026 21:09
@claude
Copy link

claude bot commented Feb 1, 2026

Code Review

Found 2 high-signal issues that need attention:

Issue 1: Set ordering bug in state.py

File: bluebox/data_models/routine_discovery/state.py
Location: processed_transactions field declaration and get_ordered_transactions() method

Problem: The processed_transactions field is declared as set[str], but get_ordered_transactions() relies on this collection preserving insertion order to correctly sequence operations with their dependencies. Python sets do not guarantee insertion order—this will cause non-deterministic routine construction where operations may execute in arbitrary order, breaking execution when a dependent fetch runs before the data it needs.

Fix: Change processed_transactions: set[str] to dict[str, None] (which preserves insertion order as of Python 3.7+), or maintain a separate processed_transaction_ids: list[str] to track processing order.

See: https://github.com/VectorlyApp/bluebox-sdk/blob/0b481a693bcb6770b4383fe846e1cd5dcd16fbe7/bluebox/data_models/routine_discovery/state.py#L27-L30


Issue 2: Pydantic PrivateAttr required for _state

File: bluebox/agents/routine_discovery_agent.py
Line: 98

Problem: The _state attribute is declared as _state: RoutineDiscoveryState | None = None on a Pydantic v2 BaseModel. In Pydantic v2, underscore-prefixed attributes require PrivateAttr() to work correctly as private instance attributes. Without it, this becomes a class variable shared across all instances rather than a per-instance attribute managed by Pydantic's model lifecycle.

Fix:

from pydantic import PrivateAttr

_state: RoutineDiscoveryState | None = PrivateAttr(default=None)

See: https://github.com/VectorlyApp/bluebox-sdk/blob/0b481a693bcb6770b4383fe846e1cd5dcd16fbe7/bluebox/agents/routine_discovery_agent.py#L96-L99

Reference: Pydantic v2 Private Attributes Documentation


Checked for bugs and CLAUDE.md compliance.

@dimavrem22
Copy link
Contributor

Some incorrect behaviors:

  • when agent resolves some value from both session storage and a previous network request, agent should be adding that network request to the queue and prioritizing getting that value through network (e.g. spirit airlines one way flight search)
  • i notice after agent executes a routine, even if the execution was not successful, the agent is finishes discovery (should be feeding the output back into llm for it to review and decide if it is done working or not)

@rayruizhiliao
Copy link
Contributor Author

  • when agent resolves some value from both session storage and a previous network request, agent should be adding that network request to the queue and prioritizing getting that value through network (e.g. spirit airlines one way flight search)

Updated in commit 0086491

Now we tell the LLM to use source_type='transaction' as the primary source when both storage and transaction sources are found. The guidance in both the tool description and system prompt now correctly instructs this behavior.

@claude
Copy link

claude bot commented Feb 1, 2026

Code Review - High Signal Issues Found

I've reviewed PR #121 and found 4 high-severity issues that need to be addressed:


1. 🐛 Non-deterministic transaction ordering in state.py (Line 173)

File: bluebox/data_models/routine_discovery/state.py
Reference: https://github.com/VectorlyApp/bluebox-sdk/blob/f1a30901fa3270b71e0a841b1dd1b2f6cb34ab66/bluebox/data_models/routine_discovery/state.py#L172-L179

Issue: The processed_transactions field is declared as set[str] (line 49), but get_ordered_transactions() does list(self.processed_transactions) followed by .reverse(). Since Python sets have no guaranteed iteration order, this produces arbitrary ordering instead of the expected BFS processing order.

This means dependency transactions can execute after the transactions that depend on them, causing failures when a transaction expects a value (e.g., a token) from a prior response.

Fix: Change processed_transactions from set[str] to list[str] to preserve insertion order.


2. 🐛 Pydantic private attribute not properly declared (Line 81)

File: bluebox/agents/routine_discovery_agent.py
Reference: https://github.com/VectorlyApp/bluebox-sdk/blob/f1a30901fa3270b71e0a841b1dd1b2f6cb34ab66/bluebox/agents/routine_discovery_agent.py#L80-L82

Issue: _state is declared as _state: RoutineDiscoveryState | None = None on a Pydantic BaseModel. In Pydantic v2, underscore-prefixed attributes require PrivateAttr() to be recognized. Without it, self._state = RoutineDiscoveryState() (line 819) will raise AttributeError.

Fix: Change the declaration to:

from pydantic import PrivateAttr

_state: RoutineDiscoveryState | None = PrivateAttr(default=None)

3. 🐛 Message history corruption during tool execution (Lines 855, 884)

File: bluebox/agents/routine_discovery_agent.py
Reference: https://github.com/VectorlyApp/bluebox-sdk/blob/f1a30901fa3270b71e0a841b1dd1b2f6cb34ab66/bluebox/agents/routine_discovery_agent.py#L854-L886

Issue: _tool_construct_routine adds user/assistant messages to self.message_history during tool execution (lines 855, 884). This happens before the outer agent loop adds the tool_result (line 1306), creating a malformed conversation sequence: assistant(tool_call) -> user -> assistant -> tool_result.

LLM APIs require assistant(tool_call) -> tool_result immediately. This corrupted history will cause errors or unpredictable behavior on subsequent iterations.

Fix: The productionization LLM call should use a separate message list, not inject into the main message_history.


4. 📋 Missing tests for new features (CLAUDE.md violation)

CLAUDE.md Quote (Line 198): "Add tests for new features"

Issue: PR 121 adds 1,355 lines of new code including:

  • 2 entirely new files (state.py with 191 lines, routine_discovery_tools.py with 400 lines)
  • Complete refactor of RoutineDiscoveryAgent (+732/-608 lines)

No test files were added or modified in this PR. No existing tests cover the new state management, tool definitions, or refactored agentic loop workflow.

Fix: Add tests for:

  • RoutineDiscoveryState class and its methods
  • Tool definitions and the agentic loop workflow
  • Integration tests for the end-to-end routine discovery process

Additional Minor Issue

Cosmetic: Line 83 has a malformed comment with stray text: # === Prompts ===record_identified_transaction, record_extracted_variables, record_resolved_variable should be just # === Prompts ===


Summary: 3 high-severity bugs that will cause runtime failures + 1 CLAUDE.md compliance issue requiring test coverage.

@rayruizhiliao
Copy link
Contributor Author

CLAUDE.md Violations

All violations are related to missing type hints. Per CLAUDE.md:41:

IMPORTANT: Every function and method MUST have type hints

The codebase consistently uses type hints on all test methods and fixtures (see test_js_utils.py and conftest.py).

Files affected:

  1. tests/unit/agents/test_routine_discovery_agent.py

    • Lines 31, 46, 56, 62: Fixtures missing return type hints (-> MagicMock: or -> RoutineDiscoveryAgent:)
    • All 33 test methods missing -> None return type hints
  2. tests/unit/data_models/routine_discovery/test_state.py

    • All test methods missing -> None return type hints
  3. tests/unit/llms/tools/test_routine_discovery_tools.py

    • All test methods missing -> None return type hints

Fixed in commit 8779564

@rayruizhiliao rayruizhiliao changed the title Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool Refactor RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool Feb 2, 2026
@dimavrem22
Copy link
Contributor

Screenshot 2026-02-02 at 8 35 45 PM

idk why but the agent tends to give empty strings for test params

@dimavrem22
Copy link
Contributor

544077140-e342dcce-ceac-4661-91fb-4bb3df91686b

the actual routine that was generated was good! when I fixed params, it worked well. but the agent itself still executed routine that failed (note: last fetch result is 400 status code). and the pipeline moved on without correcting it. I get the sense the agent isnt able to set the test prams for some reason

@rayruizhiliao
Copy link
Contributor Author

rayruizhiliao commented Feb 3, 2026

Screenshot 2026-02-02 at 8 35 45 PM idk why but the agent tends to give empty strings for test params

There's a naming mismatch between extracted raw variables and routine parameters. In construct_routine, it defines parameters as "user-facing parameters for the routine", which I want to preserve, even though the raw API request fields may differ from the parameter names generated by the LLM.

To resolve this, I added a mapping in construct_routine between extracted raw variable names and routine parameter names (commit 678f350). As a result, quickstart.py now produces reasonable test parameters.

Screenshot 2026-02-02 at 10 30 20 PM

@claude
Copy link

claude bot commented Feb 3, 2026

Code Review

Found 2 CLAUDE.md compliance issues:

1. Missing return type annotation on __init__ method

File: bluebox/sdk/discovery.py:62-71

The __init__ method is missing a -> None return type annotation.

CLAUDE.md requirement: Every function and method MUST have type hints

Fix: Add -> None: to the method signature:

def __init__(
    self,
    client: BlueBoxClient,
    task: str,
    cdp_captures_dir: str,
    output_dir: str = "./routine_discovery_output",
    llm_model: str = "gpt-5.1",
    message_callback: Callable[[RoutineDiscoveryMessage], None] | None = None,
    remote_debugging_address: str | None = None,
) -> None:

2. Improper import grouping

File: bluebox/sdk/discovery.py:13-23

Import grouping violates CLAUDE.md's requirement to "Group imports: stdlib, third-party, local (with blank lines between groups)".

CLAUDE.md requirement: Group imports: stdlib, third-party, local (with blank lines between groups)

Issues:

  • Missing blank line between stdlib imports (line 16) and third-party imports (line 17)
  • Comment and constant definition (lines 20-21) appear between third-party and local imports

Fix: Reorganize imports:

import json
import os
from pathlib import Path
from typing import Callable

from openai import OpenAI
from pydantic import BaseModel

from ..agents.routine_discovery_agent import RoutineDiscoveryAgent
from ..llms.infra.data_store import LocalDiscoveryDataStore
# ... rest of local imports

# Package root for code_paths (bluebox/sdk/ -> bluebox/)
PACKAGE_ROOT = Path(__file__).resolve().parent.parent

Checked for bugs and CLAUDE.md compliance.

@rayruizhiliao
Copy link
Contributor Author

rayruizhiliao commented Feb 3, 2026

544077140-e342dcce-ceac-4661-91fb-4bb3df91686b the actual routine that was generated was good! when I fixed params, it worked well. but the agent itself still executed routine that failed (note: last fetch result is 400 status code). and the pipeline moved on without correcting it. I get the sense the agent isnt able to set the test prams for some reason

The LLM had no visibility into what test parameters to use. The observed values collected during discovery were only made available after the agent loop completed via get_test_parameters(), but validation happened during the loop. To resolve this, I added _get_test_parameters_for_validation() helper to collect observed values from transaction_data stored during discovery and give them to execute_routine for validation.

[2026-02-02 23:15:01] INFO:bluebox.sdk.discovery:🤔 Executing routine with parameters: ['origin', 'destination', 'departure_date', 'return_date', 'adult_count', 'currency']

@dimavrem22
Copy link
Contributor

Screenshot 2026-02-03 at 12 44 04 PM

These are the 3 last messages:

  • system tells agent to validate the routine
  • agent uses the execute routine tool
  • the tools results come back

But i dont see the part where the agent decides if this routine execution was successful or not?

Looks like we check the output itself for exec_result.ok and failed_placeholders. but it is possible that the routine itself structurally correct, but returns nothing useful in the body. We can have the agent looks at a preview of the output and make sure all is well before truly exiting.

This might be an overkill/too much for this PR tho, lmk what you think!

@claude
Copy link

claude bot commented Feb 3, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@rayruizhiliao
Copy link
Contributor Author

Screenshot 2026-02-03 at 12 44 04 PM These are the 3 last messages:
  • system tells agent to validate the routine
  • agent uses the execute routine tool
  • the tools results come back

But i dont see the part where the agent decides if this routine execution was successful or not?

Looks like we check the output itself for exec_result.ok and failed_placeholders. but it is possible that the routine itself structurally correct, but returns nothing useful in the body. We can have the agent looks at a preview of the output and make sure all is well before truly exiting.

This might be an overkill/too much for this PR tho, lmk what you think!

I think having another LLM call to "sanity check execution result content" could be potentially useful, but it needs the context of the user intent. If we don't design it carefully, it might not yield any benefits. How about we tackle it after this overhaul and tracking it here #130?

@dimavrem22
Copy link
Contributor

Screenshot 2026-02-03 at 3 36 34 PM

these messages are sent via emit_message_callback (used to display progress on the console). We might want to make sure these are concise and frequent. (e.g. we prob dont want to emit transaction ids since they are long and gross)
we might want to emit some data about cookie/token resolution, something like that

@rayruizhiliao
Copy link
Contributor Author

rayruizhiliao commented Feb 3, 2026

Screenshot 2026-02-03 at 3 36 34 PM these messages are sent via emit_message_callback (used to display progress on the console). We might want to make sure these are concise and frequent. (e.g. we prob dont want to emit transaction ids since they are long and gross) we might want to emit some data about cookie/token resolution, something like that

Updated messages below, in commit 2b3b1e1

Phase | Message | Type
-- | -- | --
Start | "Discovery initiated" | INITIATED
Start | "Scanning network traffic" | PROGRESS_THINKING
Identify | "Identified target transaction" | PROGRESS_RESULT
Process | "Analyzing request parameters" | PROGRESS_THINKING
Process | "Resolving dynamic values" | PROGRESS_THINKING
Process | "Processing dependencies" | PROGRESS_THINKING
Process | "All transactions processed, ready to construct routine" | PROGRESS_RESULT
Construct | "Building routine" | PROGRESS_RESULT
Construct | "Productionizing routine" | PROGRESS_THINKING
Validate | "Validating routine" | PROGRESS_THINKING
Validate | "Routine validation failed, retrying" | PROGRESS_RESULT
Validate | "Routine validated successfully" | PROGRESS_RESULT
Complete | "Routine generated successfully" | FINISHED
Error | {error message} | ERROR

Copy link
Contributor

@dimavrem22 dimavrem22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@rayruizhiliao rayruizhiliao merged commit 94e0457 into main Feb 3, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants