Refactor `RoutineDiscoveryAgent` to an LLM-driven agentic loop with tools + add `execute_routine` tool by rayruizhiliao · Pull Request #121 · VectorlyApp/bluebox

rayruizhiliao · 2026-01-30T20:26:44Z

This PR

refactors RoutineDiscoveryAgent from a hardcoded sequential pipeline to an LLM-driven agentic loop with tools
adds a execute_routine tool as part of the agent's loop for catching errors before finalizing the routine

…o refactor_discovery_agent

dimavrem22 · 2026-01-30T23:03:44Z

Few thoughts:

I would probably yank these for now then add back in when agent works: record_identified_transaction, record_extracted_variables, record_resolved_variable

--

Edited at 6:24p, 1/30

Dima: Potentially remove queue management tools and manually write them into the agent run flow? Alternatively be very specific about the queue conditions when the agent can exit the loop

dimavrem22 · 2026-01-30T23:05:34Z

could likely consolidate these construct_routine, finalize_routine since agent has access to docs and will get access to execution

rayruizhiliao · 2026-02-01T19:43:35Z

could likely consolidate these construct_routine, finalize_routine since agent has access to docs and will get access to execution

consolidated in commit cc148fd

rayruizhiliao · 2026-02-01T21:04:24Z

Few thoughts:

I would probably yank these for now then add back in when agent works: record_identified_transaction, record_extracted_variables, record_resolved_variable

--

Edited at 6:24p, 1/30

Dima: Potentially remove queue management tools and manually write them into the agent run flow? Alternatively be very specific about the queue conditions when the agent can exit the loop

I decided to keep phase, because validating routine format and executing routine could both impact phase transition. The agentic loop exits only when phase is COMPLETE OR max iterations are reached. Queue status influences the LLM's behavior which eventually leads to phase transitions, but the loop itself only checks the phase, not the queue directly:

            # If no tool calls and not complete, prompt the agent to continue
            elif self._state.phase != DiscoveryPhase.COMPLETE:
                status = self._state.get_queue_status()
                prompt = (
                    f"[ACTION REQUIRED] Current phase: {self._state.phase.value}. "
                    f"Queue: {status['pending_count']} pending, {status['processed_count']} processed. "
                )
                if self._state.phase == DiscoveryPhase.IDENTIFY_TRANSACTION:
                    prompt += "Use list_transactions and get_transaction to find the relevant transaction, then record_identified_transaction."
                elif self._state.phase == DiscoveryPhase.PROCESS_QUEUE:
                    if status['current']:
                        prompt += f"Currently processing: {status['current']}. Extract and resolve variables, then mark_transaction_complete."
                    elif status['pending_count'] > 0:
                        prompt += "Get the next transaction from the queue."
                    else:
                        prompt += "Queue is empty. Call construct_routine to build the routine."
                elif self._state.phase == DiscoveryPhase.CONSTRUCT_ROUTINE:
                    prompt += "Build the routine using construct_routine."
                elif self._state.phase == DiscoveryPhase.VALIDATE_ROUTINE:
                    prompt += "Validate the routine using execute_routine with test parameters."

                self._add_to_message_history("system", prompt)

claude · 2026-02-01T21:19:56Z

Code Review

Found 2 high-signal issues that need attention:

Issue 1: Set ordering bug in `state.py`

File: bluebox/data_models/routine_discovery/state.py
Location: processed_transactions field declaration and get_ordered_transactions() method

Problem: The processed_transactions field is declared as set[str], but get_ordered_transactions() relies on this collection preserving insertion order to correctly sequence operations with their dependencies. Python sets do not guarantee insertion order—this will cause non-deterministic routine construction where operations may execute in arbitrary order, breaking execution when a dependent fetch runs before the data it needs.

Fix: Change processed_transactions: set[str] to dict[str, None] (which preserves insertion order as of Python 3.7+), or maintain a separate processed_transaction_ids: list[str] to track processing order.

See: https://github.com/VectorlyApp/bluebox-sdk/blob/0b481a693bcb6770b4383fe846e1cd5dcd16fbe7/bluebox/data_models/routine_discovery/state.py#L27-L30

Issue 2: Pydantic PrivateAttr required for `_state`

File: bluebox/agents/routine_discovery_agent.py
Line: 98

Problem: The _state attribute is declared as _state: RoutineDiscoveryState | None = None on a Pydantic v2 BaseModel. In Pydantic v2, underscore-prefixed attributes require PrivateAttr() to work correctly as private instance attributes. Without it, this becomes a class variable shared across all instances rather than a per-instance attribute managed by Pydantic's model lifecycle.

Fix:

from pydantic import PrivateAttr

_state: RoutineDiscoveryState | None = PrivateAttr(default=None)

See: https://github.com/VectorlyApp/bluebox-sdk/blob/0b481a693bcb6770b4383fe846e1cd5dcd16fbe7/bluebox/agents/routine_discovery_agent.py#L96-L99

Reference: Pydantic v2 Private Attributes Documentation

Checked for bugs and CLAUDE.md compliance.

dimavrem22 · 2026-02-01T21:44:54Z

Some incorrect behaviors:

when agent resolves some value from both session storage and a previous network request, agent should be adding that network request to the queue and prioritizing getting that value through network (e.g. spirit airlines one way flight search)
i notice after agent executes a routine, even if the execution was not successful, the agent is finishes discovery (should be feeding the output back into llm for it to review and decide if it is done working or not)

rayruizhiliao · 2026-02-01T21:56:39Z

when agent resolves some value from both session storage and a previous network request, agent should be adding that network request to the queue and prioritizing getting that value through network (e.g. spirit airlines one way flight search)

Updated in commit 0086491

Now we tell the LLM to use source_type='transaction' as the primary source when both storage and transaction sources are found. The guidance in both the tool description and system prompt now correctly instructs this behavior.

claude · 2026-02-01T22:02:32Z

Code Review - High Signal Issues Found

I've reviewed PR #121 and found 4 high-severity issues that need to be addressed:

1. 🐛 Non-deterministic transaction ordering in `state.py` (Line 173)

File: bluebox/data_models/routine_discovery/state.py
Reference: https://github.com/VectorlyApp/bluebox-sdk/blob/f1a30901fa3270b71e0a841b1dd1b2f6cb34ab66/bluebox/data_models/routine_discovery/state.py#L172-L179

Issue: The processed_transactions field is declared as set[str] (line 49), but get_ordered_transactions() does list(self.processed_transactions) followed by .reverse(). Since Python sets have no guaranteed iteration order, this produces arbitrary ordering instead of the expected BFS processing order.

This means dependency transactions can execute after the transactions that depend on them, causing failures when a transaction expects a value (e.g., a token) from a prior response.

Fix: Change processed_transactions from set[str] to list[str] to preserve insertion order.

2. 🐛 Pydantic private attribute not properly declared (Line 81)

File: bluebox/agents/routine_discovery_agent.py
Reference: https://github.com/VectorlyApp/bluebox-sdk/blob/f1a30901fa3270b71e0a841b1dd1b2f6cb34ab66/bluebox/agents/routine_discovery_agent.py#L80-L82

Issue: _state is declared as _state: RoutineDiscoveryState | None = None on a Pydantic BaseModel. In Pydantic v2, underscore-prefixed attributes require PrivateAttr() to be recognized. Without it, self._state = RoutineDiscoveryState() (line 819) will raise AttributeError.

Fix: Change the declaration to:

from pydantic import PrivateAttr

_state: RoutineDiscoveryState | None = PrivateAttr(default=None)

3. 🐛 Message history corruption during tool execution (Lines 855, 884)

File: bluebox/agents/routine_discovery_agent.py
Reference: https://github.com/VectorlyApp/bluebox-sdk/blob/f1a30901fa3270b71e0a841b1dd1b2f6cb34ab66/bluebox/agents/routine_discovery_agent.py#L854-L886

Issue: _tool_construct_routine adds user/assistant messages to self.message_history during tool execution (lines 855, 884). This happens before the outer agent loop adds the tool_result (line 1306), creating a malformed conversation sequence: assistant(tool_call) -> user -> assistant -> tool_result.

LLM APIs require assistant(tool_call) -> tool_result immediately. This corrupted history will cause errors or unpredictable behavior on subsequent iterations.

Fix: The productionization LLM call should use a separate message list, not inject into the main message_history.

4. 📋 Missing tests for new features (CLAUDE.md violation)

CLAUDE.md Quote (Line 198): "Add tests for new features"

Issue: PR 121 adds 1,355 lines of new code including:

2 entirely new files (state.py with 191 lines, routine_discovery_tools.py with 400 lines)
Complete refactor of RoutineDiscoveryAgent (+732/-608 lines)

No test files were added or modified in this PR. No existing tests cover the new state management, tool definitions, or refactored agentic loop workflow.

Fix: Add tests for:

RoutineDiscoveryState class and its methods
Tool definitions and the agentic loop workflow
Integration tests for the end-to-end routine discovery process

Additional Minor Issue

Cosmetic: Line 83 has a malformed comment with stray text: # === Prompts ===record_identified_transaction, record_extracted_variables, record_resolved_variable should be just # === Prompts ===

Summary: 3 high-severity bugs that will cause runtime failures + 1 CLAUDE.md compliance issue requiring test coverage.

rayruizhiliao · 2026-02-02T00:23:37Z

CLAUDE.md Violations

All violations are related to missing type hints. Per CLAUDE.md:41:

IMPORTANT: Every function and method MUST have type hints

The codebase consistently uses type hints on all test methods and fixtures (see test_js_utils.py and conftest.py).

Files affected:

tests/unit/agents/test_routine_discovery_agent.py

Lines 31, 46, 56, 62: Fixtures missing return type hints (-> MagicMock: or -> RoutineDiscoveryAgent:)

All 33 test methods missing -> None return type hints

tests/unit/data_models/routine_discovery/test_state.py

All test methods missing -> None return type hints

tests/unit/llms/tools/test_routine_discovery_tools.py

All test methods missing -> None return type hints

Fixed in commit 8779564

…/bluebox-sdk into refactor_discovery_agent

…o refactor_discovery_agent

dimavrem22 · 2026-02-03T01:36:22Z

idk why but the agent tends to give empty strings for test params

dimavrem22 · 2026-02-03T01:45:21Z

544077140-e342dcce-ceac-4661-91fb-4bb3df91686b

the actual routine that was generated was good! when I fixed params, it worked well. but the agent itself still executed routine that failed (note: last fetch result is 400 status code). and the pipeline moved on without correcting it. I get the sense the agent isnt able to set the test prams for some reason

rayruizhiliao · 2026-02-03T03:36:06Z

idk why but the agent tends to give empty strings for test params

There's a naming mismatch between extracted raw variables and routine parameters. In construct_routine, it defines parameters as "user-facing parameters for the routine", which I want to preserve, even though the raw API request fields may differ from the parameter names generated by the LLM.

To resolve this, I added a mapping in construct_routine between extracted raw variable names and routine parameter names (commit 678f350). As a result, quickstart.py now produces reasonable test parameters.

claude · 2026-02-03T03:42:42Z

Code Review

Found 2 CLAUDE.md compliance issues:

1. Missing return type annotation on `init` method

File: bluebox/sdk/discovery.py:62-71

The __init__ method is missing a -> None return type annotation.

CLAUDE.md requirement: Every function and method MUST have type hints

Fix: Add -> None: to the method signature:

def __init__(
    self,
    client: BlueBoxClient,
    task: str,
    cdp_captures_dir: str,
    output_dir: str = "./routine_discovery_output",
    llm_model: str = "gpt-5.1",
    message_callback: Callable[[RoutineDiscoveryMessage], None] | None = None,
    remote_debugging_address: str | None = None,
) -> None:

2. Improper import grouping

File: bluebox/sdk/discovery.py:13-23

Import grouping violates CLAUDE.md's requirement to "Group imports: stdlib, third-party, local (with blank lines between groups)".

CLAUDE.md requirement: Group imports: stdlib, third-party, local (with blank lines between groups)

Issues:

Missing blank line between stdlib imports (line 16) and third-party imports (line 17)
Comment and constant definition (lines 20-21) appear between third-party and local imports

Fix: Reorganize imports:

import json
import os
from pathlib import Path
from typing import Callable

from openai import OpenAI
from pydantic import BaseModel

from ..agents.routine_discovery_agent import RoutineDiscoveryAgent
from ..llms.infra.data_store import LocalDiscoveryDataStore
# ... rest of local imports

# Package root for code_paths (bluebox/sdk/ -> bluebox/)
PACKAGE_ROOT = Path(__file__).resolve().parent.parent

Checked for bugs and CLAUDE.md compliance.

rayruizhiliao · 2026-02-03T04:21:01Z

the actual routine that was generated was good! when I fixed params, it worked well. but the agent itself still executed routine that failed (note: last fetch result is 400 status code). and the pipeline moved on without correcting it. I get the sense the agent isnt able to set the test prams for some reason

The LLM had no visibility into what test parameters to use. The observed values collected during discovery were only made available after the agent loop completed via get_test_parameters(), but validation happened during the loop. To resolve this, I added _get_test_parameters_for_validation() helper to collect observed values from transaction_data stored during discovery and give them to execute_routine for validation.

[2026-02-02 23:15:01] INFO:bluebox.sdk.discovery:🤔 Executing routine with parameters: ['origin', 'destination', 'departure_date', 'return_date', 'adult_count', 'currency']

bluebox/data_models/routine/parameter.py

bluebox/data_models/routine_discovery/state.py

bluebox/agents/routine_discovery_agent.py

dimavrem22 · 2026-02-03T17:53:24Z

These are the 3 last messages:

system tells agent to validate the routine
agent uses the execute routine tool
the tools results come back

But i dont see the part where the agent decides if this routine execution was successful or not?

Looks like we check the output itself for exec_result.ok and failed_placeholders. but it is possible that the routine itself structurally correct, but returns nothing useful in the body. We can have the agent looks at a preview of the output and make sure all is well before truly exiting.

This might be an overkill/too much for this PR tho, lmk what you think!

…o refactor_discovery_agent

claude · 2026-02-03T20:20:46Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

rayruizhiliao · 2026-02-03T20:27:48Z

These are the 3 last messages:

system tells agent to validate the routine

agent uses the execute routine tool

the tools results come back

But i dont see the part where the agent decides if this routine execution was successful or not?

Looks like we check the output itself for exec_result.ok and failed_placeholders. but it is possible that the routine itself structurally correct, but returns nothing useful in the body. We can have the agent looks at a preview of the output and make sure all is well before truly exiting.

This might be an overkill/too much for this PR tho, lmk what you think!

I think having another LLM call to "sanity check execution result content" could be potentially useful, but it needs the context of the user intent. If we don't design it carefully, it might not yield any benefits. How about we tackle it after this overhaul and tracking it here #130?

dimavrem22 · 2026-02-03T20:41:39Z

these messages are sent via emit_message_callback (used to display progress on the console). We might want to make sure these are concise and frequent. (e.g. we prob dont want to emit transaction ids since they are long and gross)
we might want to emit some data about cookie/token resolution, something like that

rayruizhiliao · 2026-02-03T20:47:40Z

these messages are sent via emit_message_callback (used to display progress on the console). We might want to make sure these are concise and frequent. (e.g. we prob dont want to emit transaction ids since they are long and gross) we might want to emit some data about cookie/token resolution, something like that

Updated messages below, in commit 2b3b1e1

Phase | Message | Type
-- | -- | --
Start | "Discovery initiated" | INITIATED
Start | "Scanning network traffic" | PROGRESS_THINKING
Identify | "Identified target transaction" | PROGRESS_RESULT
Process | "Analyzing request parameters" | PROGRESS_THINKING
Process | "Resolving dynamic values" | PROGRESS_THINKING
Process | "Processing dependencies" | PROGRESS_THINKING
Process | "All transactions processed, ready to construct routine" | PROGRESS_RESULT
Construct | "Building routine" | PROGRESS_RESULT
Construct | "Productionizing routine" | PROGRESS_THINKING
Validate | "Validating routine" | PROGRESS_THINKING
Validate | "Routine validation failed, retrying" | PROGRESS_RESULT
Validate | "Routine validated successfully" | PROGRESS_RESULT
Complete | "Routine generated successfully" | FINISHED
Error | {error message} | ERROR

dimavrem22

🔥

feat: agentic loop for routine discovery

859cc48

rayruizhiliao changed the title ~~feat: agentic loop for routine discovery~~ Refactored RoutineDiscoveryAgent from a hardcoded sequential pipeline to an LLM-driven agentic loop with tools Jan 30, 2026

feat: agentic loop for routine discovery

3815283

rayruizhiliao changed the base branch from network-spy to main January 30, 2026 22:58

Merge branch 'main' of https://github.com/VectorlyApp/bluebox-sdk int…

3785543

…o refactor_discovery_agent

rayruizhiliao added 4 commits February 1, 2026 13:13

update routine discovery agent

c1658f2

consolidate construct and finalzie routine

cc148fd

fix: tool call issues

755366e

fix: test parameter tool

851d61c

rayruizhiliao added 4 commits February 1, 2026 14:43

Merge branch 'main' into refactor_discovery_agent

7b81eba

make refactored agent compatible with sdks

dc65bb8

add execute_routine tool

55ae209

passing remote debugging addr

c8b20fc

update quickstart

0b481a6

rayruizhiliao changed the title ~~Refactored RoutineDiscoveryAgent from a hardcoded sequential pipeline to an LLM-driven agentic loop with tools~~ Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add "execute_routine" tool Feb 1, 2026

rayruizhiliao changed the title ~~Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add "execute_routine" tool~~ Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool Feb 1, 2026

rayruizhiliao marked this pull request as ready for review February 1, 2026 21:09

rayruizhiliao requested review from alex-w-99 and dimavrem22 as code owners February 1, 2026 21:09

rayruizhiliao added 2 commits February 1, 2026 16:52

prioritize transaction sources to storage

1812e49

prioritize transaction sources to storage

0086491

review execute_routine results

f1a3090

rayruizhiliao changed the title ~~Refactored RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool~~ Refactor RoutineDiscoveryAgent to an LLM-driven agentic loop with tools + add execute_routine tool Feb 2, 2026

rayruizhiliao added 2 commits February 2, 2026 14:53

Merge branch 'remaining-datastores' of https://github.com/VectorlyApp…

28d2369

…/bluebox-sdk into refactor_discovery_agent

Merge branch 'main' of https://github.com/VectorlyApp/bluebox-sdk int…

efa254d

…o refactor_discovery_agent

mapping source var

678f350

rayruizhiliao added 2 commits February 2, 2026 23:09

feat: _get_test_parameters_for_validation

6be1011

add a fall back

512ee0f

dimavrem22 reviewed Feb 3, 2026

View reviewed changes

bluebox/data_models/routine/parameter.py Show resolved Hide resolved

dimavrem22 reviewed Feb 3, 2026

View reviewed changes

bluebox/data_models/routine_discovery/state.py Outdated Show resolved Hide resolved

dimavrem22 reviewed Feb 3, 2026

View reviewed changes

bluebox/data_models/routine_discovery/state.py Outdated Show resolved Hide resolved

dimavrem22 reviewed Feb 3, 2026

View reviewed changes

bluebox/agents/routine_discovery_agent.py Show resolved Hide resolved

rayruizhiliao added 5 commits February 3, 2026 13:39

Merge branch 'main' of https://github.com/VectorlyApp/bluebox-sdk int…

7ca513b

…o refactor_discovery_agent

replace source var with observed val

9322070

rename get next transaction

cf5b11e

split the responsibilities

308d8e3

add back placeholder examples

ce7802a

rayruizhiliao mentioned this pull request Feb 3, 2026

Sanity check execution result content #130

Open

update emitted messages

2b3b1e1

dimavrem22 approved these changes Feb 3, 2026

View reviewed changes

rayruizhiliao merged commit 94e0457 into main Feb 3, 2026
9 checks passed

Conversation

rayruizhiliao commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimavrem22 commented Jan 30, 2026 • edited by rayruizhiliao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimavrem22 commented Jan 30, 2026

Uh oh!

rayruizhiliao commented Feb 1, 2026

Uh oh!

rayruizhiliao commented Feb 1, 2026

Uh oh!

claude bot commented Feb 1, 2026

Code Review

Issue 1: Set ordering bug in state.py

Issue 2: Pydantic PrivateAttr required for _state

Uh oh!

dimavrem22 commented Feb 1, 2026

Uh oh!

rayruizhiliao commented Feb 1, 2026

Uh oh!

claude bot commented Feb 1, 2026

Code Review - High Signal Issues Found

1. 🐛 Non-deterministic transaction ordering in state.py (Line 173)

2. 🐛 Pydantic private attribute not properly declared (Line 81)

3. 🐛 Message history corruption during tool execution (Lines 855, 884)

4. 📋 Missing tests for new features (CLAUDE.md violation)

Additional Minor Issue

Uh oh!

rayruizhiliao commented Feb 2, 2026

CLAUDE.md Violations

Uh oh!

dimavrem22 commented Feb 3, 2026

Uh oh!

dimavrem22 commented Feb 3, 2026

Uh oh!

rayruizhiliao commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 3, 2026

Code Review

1. Missing return type annotation on __init__ method

2. Improper import grouping

Uh oh!

rayruizhiliao commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dimavrem22 commented Feb 3, 2026

Uh oh!

claude bot commented Feb 3, 2026

Code review

Uh oh!

rayruizhiliao commented Feb 3, 2026

Uh oh!

dimavrem22 commented Feb 3, 2026

Uh oh!

rayruizhiliao commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimavrem22 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rayruizhiliao commented Jan 30, 2026 •

edited

Loading

dimavrem22 commented Jan 30, 2026 •

edited by rayruizhiliao

Loading

Issue 1: Set ordering bug in `state.py`

Issue 2: Pydantic PrivateAttr required for `_state`

1. 🐛 Non-deterministic transaction ordering in `state.py` (Line 173)

rayruizhiliao commented Feb 3, 2026 •

edited

Loading

1. Missing return type annotation on `init` method

rayruizhiliao commented Feb 3, 2026 •

edited

Loading

rayruizhiliao commented Feb 3, 2026 •

edited

Loading