Skip to content

Parallelize multi-turn session evaluation#19222

Merged
AveshCSingh merged 6 commits intomlflow:masterfrom
AveshCSingh:parallelize-multi-turn-eval
Dec 11, 2025
Merged

Parallelize multi-turn session evaluation#19222
AveshCSingh merged 6 commits intomlflow:masterfrom
AveshCSingh:parallelize-multi-turn-eval

Conversation

@AveshCSingh
Copy link
Copy Markdown
Contributor

@AveshCSingh AveshCSingh commented Dec 4, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19222/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19222/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19222/merge

What changes are proposed in this pull request?

Updates mlflow.genai.evaluate to run multi-turn scorers in parallel, and to include their execution in the progress bar.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Manual validation: Ran script, got output:

$ uv run python tmp/scripts/test_multi_turn_eval.py
warning: Failed to parse `pyproject.toml` during settings discovery:
  TOML parse error at line 232, column 1
      |
  232 | exclude-dependencies = ["databricks-connect"]
      | ^^^^^^^^^^^^^^^^^^^^
  unknown field `exclude-dependencies`, expected one of `required-version`, `native-tls`, `offline`, `no-cache`, `cache-dir`, `preview`, `python-preference`, `python-downloads`, `concurrent-downloads`, `concurrent-builds`, `concurrent-installs`, `index`, `index-url`, `extra-index-url`, `no-index`, `find-links`, `index-strategy`, `keyring-provider`, `allow-insecure-host`, `resolution`, `prerelease`, `fork-strategy`, `dependency-metadata`, `config-settings`, `config-settings-package`, `no-build-isolation`, `no-build-isolation-package`, `extra-build-dependencies`, `extra-build-variables`, `exclude-newer`, `exclude-newer-package`, `link-mode`, `compile-bytecode`, `no-sources`, `upgrade`, `upgrade-package`, `reinstall`, `reinstall-package`, `no-build`, `no-build-package`, `no-binary`, `no-binary-package`, `python-install-mirror`, `pypy-install-mirror`, `python-downloads-json-url`, `publish-url`, `trusted-publishing`, `check-url`, `add-bounds`, `pip`, `cache-keys`, `override-dependencies`, `constraint-dependencies`, `build-constraint-dependencies`, `environments`, `required-environments`, `conflicts`, `workspace`, `sources`, `managed`, `package`, `default-groups`, `dependency-groups`, `dev-dependencies`, `build-backend`

======================================================================
Multi-Turn Evaluation Demo
======================================================================
2025/12/04 23:08:39 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/04 23:08:39 INFO mlflow.store.db.utils: Updating database tables
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Using tracking URI: sqlite:///mlflow.db


1. Creating traces with session metadata...
----------------------------------------------------------------------

  Session 1 (3 turns):
    - Turn 1: tr-ccef92cc0cd2751089dd35a893b8c1e7
    - Turn 2: tr-785902683fb01c268406e0dc299cffe2
    - Turn 3: tr-107bd59286c1f9f34d10fec772b1a4e9

  Session 2 (2 turns):
    - Turn 1: tr-f30435eff45eb6f3399f61145fdffc5a
    - Turn 2: tr-ceeb6a34ac999a8fb44a166467365d55

2. Creating evaluation dataset from traces...
----------------------------------------------------------------------
   Created dataset with 5 traces

3. Setting up scorers...
----------------------------------------------------------------------
   Single-turn scorers:
     - ResponseLengthScorer: Measures response length
   Multi-turn scorers:
     - ConversationLengthScorer: Counts turns per session
     - AverageResponseTimeScorer: Average response time per session

4. Running evaluation...
----------------------------------------------------------------------
Evaluating:  71%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                    | 5/7 [Elapsed: 00:00, Remaining: 00:00]
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [Elapsed: 00:05, Remaining: 00:00]

✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: merciful-mule-166
  Run ID: e10cb60391d44813b09239ef2f5008a0

To view the detailed evaluation results with sample-wise scores,
open the Traces tab in the Run page in the MLflow UI.


5. Results:
======================================================================

Aggregated Metrics:
----------------------------------------------------------------------
  response_length/mean: 31.00
  avg_response_time/mean: 0.00
  conversation_length/mean: 2.50

Per-Trace Results:
----------------------------------------------------------------------

  Multi-turn scores (one per session):
    tr-eaac9ffd5b81ca3b2... -> 3 turns, 0.0ms avg
    tr-420b8d8bed3e89b40... -> 2 turns, 0.0ms avg

  Single-turn scores (one per trace):
    tr-eaac9ffd5b81ca3b2... -> 38 chars
    tr-e790fe4eeadfe4d8a... -> 31 chars
    tr-de3e03ee4ecaecc55... -> 26 chars
    tr-420b8d8bed3e89b40... -> 29 chars
    tr-db8e31bad089fb52e... -> 31 chars

======================================================================
Demo completed successfully!
======================================================================

Key observations:
  - 5 traces total (3 in session_1, 2 in session_2)
  - 5 single-turn scores (one per trace)
  - 2 multi-turn scores (one per session)
  - Multi-turn scorers see all traces in each session

You can view detailed results in the MLflow UI:
  Run ID: e10cb60391d44813b09239ef2f5008a0

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Updates mlflow.genai.evaluate to run multi-turn scorers in parallel, and to include their execution in the progress bar.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 4, 2025

Documentation preview for bf2ac8c is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
@AveshCSingh AveshCSingh changed the title [wip] Parallelize multi-turn session evaluation Parallelize multi-turn session evaluation Dec 4, 2025
@github-actions github-actions Bot added v3.7.0 area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 4, 2025
@AveshCSingh AveshCSingh force-pushed the parallelize-multi-turn-eval branch from 7472162 to 08dcc2c Compare December 4, 2025 23:13
@AveshCSingh AveshCSingh removed the v3.7.0 label Dec 4, 2025
@AveshCSingh AveshCSingh requested a review from smoorjani December 4, 2025 23:15
@AveshCSingh AveshCSingh removed the rn/none List under Small Changes in Changelogs. label Dec 4, 2025
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
@github-actions github-actions Bot added the v3.7.0 label Dec 4, 2025
@AveshCSingh AveshCSingh added the rn/bug-fix Mention under Bug Fixes in Changelogs. label Dec 5, 2025
@AveshCSingh AveshCSingh force-pushed the parallelize-multi-turn-eval branch from e38c3d9 to f335aab Compare December 5, 2025 01:30
@github-actions github-actions Bot added v3.7.1 and removed v3.7.0 labels Dec 5, 2025
Copy link
Copy Markdown
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few style nits (e.g., don't need a lot of the one liner comments), but my biggest question is whether we can submit single-turn and multi-turn evaluation tasks at the same time or does it have to be single-turn first?

progress_bar.update(1)

# Phase 2: Submit and complete multi-turn tasks (after single-turn)
# We run multi-turn scorers after single-turn, since single-turn scorers may create new
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment - I thought we expect all traces are already generated (i.e., static dataset)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't the case for EvaluationDatasets, where the trace is only linked from the EvalItem. _run_single will create minimal traces in this case, and multi-turn assessments will be logged on these traces.

When you evaluate a dataset withouth a predict_fn, minimal traces will be created from the request/response of the source traces.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR, but do you think it's worth decomposing this? Step 1: generate traces for all rows, step 2: evaluate all traces

session_groups = defaultdict(list)

for item in eval_items:
if not getattr(item, "trace", None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why did we change this code? it seems functionally the same but the first iteration was cleaner

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is here:

        if not session_id and item.source is not None:
            session_id = item.source.source_data.get("session_id")

When the eval_item is from an evaluation dataset, EvalItem.trace will be None at the time when group_traces_by_session is called (code pointer). Instead we need to look for the session ID within the source_metadata which gets populated when merging records into the dataset (code pointer).

Why didn't we need this change before this PR? Well previously we did not care about how many sessions there were since the progress bar did not track session-level scorers. Now that it does, we must count the sessions before _run_single populates EvalItem.trace with minimal traces.

Comment thread mlflow/genai/evaluation/session_utils.py Outdated
Comment thread mlflow/genai/evaluation/harness.py Outdated
if progress_bar:
progress_bar.close()

# Log multi-turn assessments to traces
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe don't need some of these one-liner comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed --I've removed a bunch of these.

@AveshCSingh
Copy link
Copy Markdown
Contributor Author

left a few style nits (e.g., don't need a lot of the one liner comments), but my biggest question is whether we can submit single-turn and multi-turn evaluation tasks at the same time or does it have to be single-turn first?

The reason I opted to do single turn first and then multi-turn in this PR is because with evaluation datasets, the single-turn pass creates minimal traces to log assessments to. Multi-turn scorers must log assessments to these same trces, so they must be created prior to running multi-turn evaluation.

An alternative approach here is to:
(1) Create minimal traces for all EvalItems that don't already have trace. This should be all EvalItems for EvaluationDataset-based evaluation, and no EvalItems for trace-based evaluation
(2) Run single-turn and multi-turn in parallel

Given that (1) calls get_trace, it's not actually clear to me that this is going to be faster in practice. The advantage of the current approach is that we pipeline the parallelizable work so some threads are running get_trace while others evaluate scorers. That said, it's clearly suboptimal for trace-based evaluation.

@AveshCSingh AveshCSingh requested a review from smoorjani December 6, 2025 00:51
@smoorjani
Copy link
Copy Markdown
Collaborator

The reason I opted to do single turn first and then multi-turn in this PR is because with evaluation datasets, the single-turn pass creates minimal traces to log assessments to. Multi-turn scorers must log assessments to these same trces, so they must be created prior to running multi-turn evaluation.

An alternative approach here is to: (1) Create minimal traces for all EvalItems that don't already have trace. This should be all EvalItems for EvaluationDataset-based evaluation, and no EvalItems for trace-based evaluation (2) Run single-turn and multi-turn in parallel

Given that (1) calls get_trace, it's not actually clear to me that this is going to be faster in practice. The advantage of the current approach is that we pipeline the parallelizable work so some threads are running get_trace while others evaluate scorers. That said, it's clearly suboptimal for trace-based evaluation.

Not a blocker for this PR, but can you file a ticket for the alternative approach? IMO it's confusing to think of evaluation datasets separately than the other case (there are two cases in this PR where we had to add some code complexity). It seems much cleaner conceptually to generate all traces first and then start evaluating the traces.

@AveshCSingh
Copy link
Copy Markdown
Contributor Author

Yes I see your point. I've filed https://databricks.atlassian.net/browse/ML-60284.

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
@AveshCSingh
Copy link
Copy Markdown
Contributor Author

@B-Step62 , should I be able to merge this PR myself now that #19241 is merged? I still see 2 failing checks:

  • Maintainer approval
  • Protect

and do not see a merge button.

@AveshCSingh AveshCSingh force-pushed the parallelize-multi-turn-eval branch from c2ada07 to bf2ac8c Compare December 8, 2025 17:34
@AveshCSingh AveshCSingh added this pull request to the merge queue Dec 11, 2025
Merged via the queue into mlflow:master with commit d496198 Dec 11, 2025
65 of 67 checks passed
@AveshCSingh AveshCSingh deleted the parallelize-multi-turn-eval branch December 11, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/bug-fix Mention under Bug Fixes in Changelogs. v3.7.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants