Skip to content

fix(core): emit judge usage telemetry on eval scorers#1168

Merged
omeraplak merged 1 commit intomainfrom
fix/eval-scorer-cost-telemetry
Mar 20, 2026
Merged

fix(core): emit judge usage telemetry on eval scorers#1168
omeraplak merged 1 commit intomainfrom
fix/eval-scorer-cost-telemetry

Conversation

@omeraplak
Copy link
Copy Markdown
Member

@omeraplak omeraplak commented Mar 20, 2026

PR Checklist

Please check if your PR fulfills the following requirements:

Bugs / Features

What is the current behavior?

LLM-based eval scorers can collect judge usage and provider cost information, but that telemetry is not emitted on scorer spans.

As a result, downstream observability and cost aggregation cannot reliably attribute eval scorer token/cost usage separately from the main agent run.

What is the new behavior?

createLLMJudgeScorer now preserves judge model, normalized token usage, and OpenRouter provider cost details in scorer metadata, and eval span creation maps that telemetry onto scorer span attributes.

This makes scorer-side usage visible in observability pipelines and enables downstream cost aggregation to split agent cost from eval scorer cost.

fixes N/A

Notes for reviewers

  • Verified with pnpm --filter @voltagent/core typecheck
  • Verified with pnpm --filter @voltagent/core build
  • The branch intentionally only includes the eval telemetry changes and the new changeset.

Summary by cubic

Emit judge usage and cost telemetry on eval scorer spans in @voltagent/core so observability and cost reports can separate eval scorer usage from the main agent run.

  • Bug Fixes
    • Store judge model, normalized token usage (prompt, completion, total, cached, reasoning), and OpenRouter cost in scorer metadata.
    • Populate scorer span attributes (ai.model.name, usage.*, usage.cost, usage.cost_details.*) from that telemetry.
    • Normalize usage from success and error paths and extract provider cost from providerMetadata.

Written for commit 5598cc8. Summary will update on new commits.

Summary by CodeRabbit

  • New Features
    • Enhanced telemetry for eval scorer spans now captures judge model identification, comprehensive token usage metrics (including cached and reasoning tokens), and provider-reported cost breakdowns. This enables improved observability in backend systems and supports downstream cost aggregation that distinguishes eval scoring costs from agent operation costs.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 20, 2026

🦋 Changeset detected

Latest commit: 5598cc8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@voltagent/core Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@joggrbot

This comment has been minimized.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This change introduces telemetry capture for LLM judge scoring operations in VoltAgent. It extracts judge model information, token usage (including cached and reasoning tokens), and provider cost details from judge scorer execution, then attaches this metadata to observable span attributes for downstream observability and cost aggregation.

Changes

Cohort / File(s) Summary
Changeset Documentation
.changeset/eval-scorer-cost-telemetry.md
Adds changeset entry for @voltagent/core patch release documenting new judge telemetry emission on eval scorer spans (model, token usage, provider costs).
Judge Telemetry Extraction
packages/core/src/agent/eval.ts
Introduces JudgeTelemetry interface and extractJudgeTelemetry() function with safe parsing helpers to read judge metadata from combined records, then extends createScorerSpanAttributes to attach extracted model name, token counts, and cost breakdowns to span attributes.
Judge Scorer Telemetry Capture
packages/core/src/eval/llm/create-judge-scorer.ts
Enhances createLLMJudgeScorer to capture and normalize judge usage and providerMetadata from LLM calls, extract OpenRouter cost details, and attach captured telemetry to scorer metadata as voltAgent.judge on both success and error paths, including multiple helper functions for model resolution, cost extraction, and metadata normalization.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Scorer as LLM Judge Scorer
    participant LLM as Judge Model
    participant Span as Span Attributes

    Client->>Scorer: createLLMJudgeScorer.evaluate(payload)
    Scorer->>LLM: generateText(prompt)
    LLM-->>Scorer: text, usage, providerMetadata
    Scorer->>Scorer: extractJudgeTelemetry()<br/>(model, usage, costs)
    Scorer-->>Client: ScorerResult with voltAgent.judge metadata
    Client->>Span: createScorerSpanAttributes(metadata)
    Span->>Span: extractJudgeTelemetry()<br/>from metadata
    Span-->>Span: Attach ai.model.name,<br/>usage.*, usage.cost
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • lzj960515

Poem

🐰 A judge hops in with token tales,
Usage counts on scoring scales,
Cost details caught and costs unfurled,
Judge telemetry takes the world! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: emitting judge usage telemetry on eval scorers, which directly aligns with the changeset and file modifications.
Description check ✅ Passed The description comprehensively covers the template sections, clearly explains current vs. new behavior, documents the changes made, confirms changesets and typechecking, and includes verification details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/eval-scorer-cost-telemetry
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying voltagent with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5598cc8
Status:🚫  Build failed.

View logs

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/src/eval/llm/create-judge-scorer.ts">

<violation number="1" location="packages/core/src/eval/llm/create-judge-scorer.ts:238">
P2: OpenRouter judge telemetry parsing is incomplete and can miss provider cost fields when metadata uses snake_case keys.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

? providerMetadata.openrouter
: undefined;
const usage = isRecord(openRouterMetadata?.usage) ? openRouterMetadata.usage : undefined;
const costDetails = isRecord(usage?.costDetails) ? usage.costDetails : undefined;
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: OpenRouter judge telemetry parsing is incomplete and can miss provider cost fields when metadata uses snake_case keys.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/src/eval/llm/create-judge-scorer.ts, line 238:

<comment>OpenRouter judge telemetry parsing is incomplete and can miss provider cost fields when metadata uses snake_case keys.</comment>

<file context>
@@ -178,3 +205,112 @@ function stringify(value: unknown): string {
+    ? providerMetadata.openrouter
+    : undefined;
+  const usage = isRecord(openRouterMetadata?.usage) ? openRouterMetadata.usage : undefined;
+  const costDetails = isRecord(usage?.costDetails) ? usage.costDetails : undefined;
+
+  if (!usage) {
</file context>
Fix with Cubic

@omeraplak omeraplak merged commit 2075bd9 into main Mar 20, 2026
22 of 24 checks passed
@omeraplak omeraplak deleted the fix/eval-scorer-cost-telemetry branch March 20, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant