Skip to content

perf: reduce completion eval overhead and fix follow-up quality#179

Merged
rockfordlhotka merged 5 commits intomainfrom
perf/reduce-completion-eval-overhead
Mar 20, 2026
Merged

perf: reduce completion eval overhead and fix follow-up quality#179
rockfordlhotka merged 5 commits intomainfrom
perf/reduce-completion-eval-overhead

Conversation

@rockfordlhotka
Copy link
Copy Markdown
Member

Summary

  • Skip completion evaluation in subagents — the primary agent's evaluator catches incomplete results when synthesising subagent output, so the subagent's own eval was redundant (saved 10-30s per subagent task)
  • Cap completion re-prompts from 2 to 1 — second re-prompt rarely improved results, added 15-30s latency
  • Discard follow-up passes that made no tool calls — prevents "split personality" responses where the follow-up narrates/refuses instead of acting, then gets concatenated to a clean response
  • Tighten follow-up evaluator prompt — steer toward skill refinement, away from suggesting server-side logic changes or redundant cross-system verification

Test plan

  • All 560+ unit tests pass
  • Deployed to cluster — verify faster turn times in logs (rockbot.agent.turn.duration)
  • Verify follow-up discards appear in logs (Follow-up pass made no tool calls; discarding)
  • Verify no split-personality responses in multi-turn conversations
  • Verify subagent tasks still produce useful results without their own completion eval

🤖 Generated with Claude Code

rockfordlhotka and others added 5 commits March 20, 2026 00:22
Subagent tool loops no longer run the completion evaluator — the primary
agent's evaluator catches incomplete results when it synthesises the
subagent output, eliminating 2 redundant LLM round-trips per subagent
task (10-30s savings observed in production logs).

Default MaxCompletionReprompts reduced from 2 to 1. The second re-prompt
rarely improved results and added 15-30s of latency. Model-specific
overrides still work via ModelBehavior.MaxCompletionRepromptsOverride.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up responses that contain only narration, refusals, or re-statements
of the original answer are now discarded before concatenation. This catches
the "split personality" pattern where the follow-up evaluator finds an
opportunity but the LLM refuses or scope-polices instead of acting —
producing contradictory content appended to an otherwise clean response.

The check counts FunctionCallContent (native path) and [Tool result for ...]
messages (text-based path) added during the follow-up loop. Zero tool calls
means the follow-up added no new information and is dropped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add skill creation/refinement to the good follow-ups list so the evaluator
suggests reusable learnings. Add two new bad follow-up patterns that were
causing split-personality responses: implementing server-side logic/rules
(agent can't change service behavior at runtime) and searching unrelated
systems to double-check work already completed via the authoritative source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

The follow-up evaluator now classifies the user's original request as
closed/specific ("what is on my todo list?") vs open/exploratory ("find
emails from Richard and see if I have outstanding requests") before
considering follow-ups. Closed requests almost never warrant follow-ups —
the user asked for X, got X, done. Exploratory requests may benefit from
connecting dots across systems.

This addresses the root cause of unnecessary follow-up passes on simple
queries that were adding 18-50s of latency and sometimes producing
contradictory "split personality" responses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the agent loop called spawn_subagent or invoke_agent, the completion
evaluator now skips rather than re-prompting. The SubagentResultHandler
will deliver the result — re-prompting races with it and produces
duplicate answers (the user sees both the subagent result and the
re-prompted primary response).

Also updates directives to allow direct handling of simple closed
questions that need only 1-2 tool calls. The subagent overhead is
counterproductive for "when does my class end?" style queries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rockfordlhotka rockfordlhotka merged commit 897efd4 into main Mar 20, 2026
2 checks passed
@rockfordlhotka rockfordlhotka deleted the perf/reduce-completion-eval-overhead branch March 20, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant