Skip to content

feat: add ground truth assertion support to Goal Success Rate evaluator#180

Open
ybdarrenwang wants to merge 2 commits intostrands-agents:mainfrom
ybdarrenwang:feature/gt-gsr
Open

feat: add ground truth assertion support to Goal Success Rate evaluator#180
ybdarrenwang wants to merge 2 commits intostrands-agents:mainfrom
ybdarrenwang:feature/gt-gsr

Conversation

@ybdarrenwang
Copy link
Copy Markdown
Collaborator

Description

This PR enhances the GoalSuccessRateEvaluator to support assertion-based evaluation mode alongside the existing conversation-analysis mode. The evaluator now accepts ground truth assertions via metadata and validates agent behavior against these explicit success criteria.

Related Issues

#95

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com>
Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>
Copy link
Copy Markdown
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would prefer the assertions to be expected_criteria / expected_goal to align with what other evaluators have, and to avoid confusion between deterministic assertions
    1.2. Add expected_criteria : str | None = None to EvaluationData and Case
  2. How useful is the additional_context?
  3. Can't the prompt be goal_success_rate_v1? what's the major diff?

@ybdarrenwang
Copy link
Copy Markdown
Collaborator Author

  1. I would prefer the assertions to be expected_criteria / expected_goal to align with what other evaluators have, and to avoid confusion between deterministic assertions
    1.2. Add expected_criteria : str | None = None to EvaluationData and Case
  2. How useful is the additional_context?
  3. Can't the prompt be goal_success_rate_v1? what's the major diff?
  1. That makes sense. Will incorporate that in the next revision.
  2. @kangISU @subbu10123 can you answer this?
  3. I don't think so. goal_success_rate and goal_success_rate_with_assertions are fundamentally different prompts. In the future their versions have to evolve separately.

@kangISU
Copy link
Copy Markdown
Collaborator

kangISU commented Mar 27, 2026

  • How useful is the additional_context?

This is for flexibility. Customers may want to include additional context for their specific use cases/domains, to further guide/help the evaluation. However, it can be removed if flexibility is not a high priority here.

…ntext and unnecessary async func

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com>
Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants