Parallelize multi-turn session evaluation by AveshCSingh · Pull Request #19222 · mlflow/mlflow

AveshCSingh · 2025-12-04T21:42:19Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19222/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19222/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19222/merge

What changes are proposed in this pull request?

Updates mlflow.genai.evaluate to run multi-turn scorers in parallel, and to include their execution in the progress bar.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Manual validation: Ran script, got output:

$ uv run python tmp/scripts/test_multi_turn_eval.py
warning: Failed to parse `pyproject.toml` during settings discovery:
  TOML parse error at line 232, column 1
      |
  232 | exclude-dependencies = ["databricks-connect"]
      | ^^^^^^^^^^^^^^^^^^^^
  unknown field `exclude-dependencies`, expected one of `required-version`, `native-tls`, `offline`, `no-cache`, `cache-dir`, `preview`, `python-preference`, `python-downloads`, `concurrent-downloads`, `concurrent-builds`, `concurrent-installs`, `index`, `index-url`, `extra-index-url`, `no-index`, `find-links`, `index-strategy`, `keyring-provider`, `allow-insecure-host`, `resolution`, `prerelease`, `fork-strategy`, `dependency-metadata`, `config-settings`, `config-settings-package`, `no-build-isolation`, `no-build-isolation-package`, `extra-build-dependencies`, `extra-build-variables`, `exclude-newer`, `exclude-newer-package`, `link-mode`, `compile-bytecode`, `no-sources`, `upgrade`, `upgrade-package`, `reinstall`, `reinstall-package`, `no-build`, `no-build-package`, `no-binary`, `no-binary-package`, `python-install-mirror`, `pypy-install-mirror`, `python-downloads-json-url`, `publish-url`, `trusted-publishing`, `check-url`, `add-bounds`, `pip`, `cache-keys`, `override-dependencies`, `constraint-dependencies`, `build-constraint-dependencies`, `environments`, `required-environments`, `conflicts`, `workspace`, `sources`, `managed`, `package`, `default-groups`, `dependency-groups`, `dev-dependencies`, `build-backend`

======================================================================
Multi-Turn Evaluation Demo
======================================================================
2025/12/04 23:08:39 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/04 23:08:39 INFO mlflow.store.db.utils: Updating database tables
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
2025-12-04 23:08:39 INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Using tracking URI: sqlite:///mlflow.db


1. Creating traces with session metadata...
----------------------------------------------------------------------

  Session 1 (3 turns):
    - Turn 1: tr-ccef92cc0cd2751089dd35a893b8c1e7
    - Turn 2: tr-785902683fb01c268406e0dc299cffe2
    - Turn 3: tr-107bd59286c1f9f34d10fec772b1a4e9

  Session 2 (2 turns):
    - Turn 1: tr-f30435eff45eb6f3399f61145fdffc5a
    - Turn 2: tr-ceeb6a34ac999a8fb44a166467365d55

2. Creating evaluation dataset from traces...
----------------------------------------------------------------------
   Created dataset with 5 traces

3. Setting up scorers...
----------------------------------------------------------------------
   Single-turn scorers:
     - ResponseLengthScorer: Measures response length
   Multi-turn scorers:
     - ConversationLengthScorer: Counts turns per session
     - AverageResponseTimeScorer: Average response time per session

4. Running evaluation...
----------------------------------------------------------------------
Evaluating:  71%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                    | 5/7 [Elapsed: 00:00, Remaining: 00:00]
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [Elapsed: 00:05, Remaining: 00:00]

✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: merciful-mule-166
  Run ID: e10cb60391d44813b09239ef2f5008a0

To view the detailed evaluation results with sample-wise scores,
open the Traces tab in the Run page in the MLflow UI.


5. Results:
======================================================================

Aggregated Metrics:
----------------------------------------------------------------------
  response_length/mean: 31.00
  avg_response_time/mean: 0.00
  conversation_length/mean: 2.50

Per-Trace Results:
----------------------------------------------------------------------

  Multi-turn scores (one per session):
    tr-eaac9ffd5b81ca3b2... -> 3 turns, 0.0ms avg
    tr-420b8d8bed3e89b40... -> 2 turns, 0.0ms avg

  Single-turn scores (one per trace):
    tr-eaac9ffd5b81ca3b2... -> 38 chars
    tr-e790fe4eeadfe4d8a... -> 31 chars
    tr-de3e03ee4ecaecc55... -> 26 chars
    tr-420b8d8bed3e89b40... -> 29 chars
    tr-db8e31bad089fb52e... -> 31 chars

======================================================================
Demo completed successfully!
======================================================================

Key observations:
  - 5 traces total (3 in session_1, 2 in session_2)
  - 5 single-turn scores (one per trace)
  - 2 multi-turn scores (one per session)
  - Multi-turn scorers see all traces in each session

You can view detailed results in the MLflow UI:
  Run ID: e10cb60391d44813b09239ef2f5008a0

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Updates mlflow.genai.evaluate to run multi-turn scorers in parallel, and to include their execution in the progress bar.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

github-actions · 2025-12-04T21:50:18Z

Documentation preview for bf2ac8c is available at:

https://pr-19222--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

smoorjani

left a few style nits (e.g., don't need a lot of the one liner comments), but my biggest question is whether we can submit single-turn and multi-turn evaluation tasks at the same time or does it have to be single-turn first?

smoorjani · 2025-12-05T20:27:12Z

+                    progress_bar.update(1)
+
+            # Phase 2: Submit and complete multi-turn tasks (after single-turn)
+            # We run multi-turn scorers after single-turn, since single-turn scorers may create new


Not sure I understand this comment - I thought we expect all traces are already generated (i.e., static dataset)?

That isn't the case for EvaluationDatasets, where the trace is only linked from the EvalItem. _run_single will create minimal traces in this case, and multi-turn assessments will be logged on these traces.

When you evaluate a dataset withouth a predict_fn, minimal traces will be created from the request/response of the source traces.

Not a blocker for this PR, but do you think it's worth decomposing this? Step 1: generate traces for all rows, step 2: evaluate all traces

smoorjani · 2025-12-05T20:32:00Z

    session_groups = defaultdict(list)

    for item in eval_items:
-        if not getattr(item, "trace", None):


nit: why did we change this code? it seems functionally the same but the first iteration was cleaner

The key difference is here:

if not session_id and item.source is not None: session_id = item.source.source_data.get("session_id")

When the eval_item is from an evaluation dataset, EvalItem.trace will be None at the time when group_traces_by_session is called (code pointer). Instead we need to look for the session ID within the source_metadata which gets populated when merging records into the dataset (code pointer).

Why didn't we need this change before this PR? Well previously we did not care about how many sessions there were since the progress bar did not track session-level scorers. Now that it does, we must count the sessions before _run_single populates EvalItem.trace with minimal traces.

smoorjani · 2025-12-05T21:23:23Z

+            if progress_bar:
+                progress_bar.close()
+
+    # Log multi-turn assessments to traces


nit: maybe don't need some of these one-liner comments

Agreed --I've removed a bunch of these.

AveshCSingh · 2025-12-06T00:51:06Z

left a few style nits (e.g., don't need a lot of the one liner comments), but my biggest question is whether we can submit single-turn and multi-turn evaluation tasks at the same time or does it have to be single-turn first?

The reason I opted to do single turn first and then multi-turn in this PR is because with evaluation datasets, the single-turn pass creates minimal traces to log assessments to. Multi-turn scorers must log assessments to these same trces, so they must be created prior to running multi-turn evaluation.

An alternative approach here is to:
(1) Create minimal traces for all EvalItems that don't already have trace. This should be all EvalItems for EvaluationDataset-based evaluation, and no EvalItems for trace-based evaluation
(2) Run single-turn and multi-turn in parallel

Given that (1) calls get_trace, it's not actually clear to me that this is going to be faster in practice. The advantage of the current approach is that we pipeline the parallelizable work so some threads are running get_trace while others evaluate scorers. That said, it's clearly suboptimal for trace-based evaluation.

smoorjani · 2025-12-08T14:45:02Z

The reason I opted to do single turn first and then multi-turn in this PR is because with evaluation datasets, the single-turn pass creates minimal traces to log assessments to. Multi-turn scorers must log assessments to these same trces, so they must be created prior to running multi-turn evaluation.

An alternative approach here is to: (1) Create minimal traces for all EvalItems that don't already have trace. This should be all EvalItems for EvaluationDataset-based evaluation, and no EvalItems for trace-based evaluation (2) Run single-turn and multi-turn in parallel

Given that (1) calls get_trace, it's not actually clear to me that this is going to be faster in practice. The advantage of the current approach is that we pipeline the parallelizable work so some threads are running get_trace while others evaluate scorers. That said, it's clearly suboptimal for trace-based evaluation.

Not a blocker for this PR, but can you file a ticket for the alternative approach? IMO it's confusing to think of evaluation datasets separately than the other case (there are two cases in this PR where we had to add some code complexity). It seems much cleaner conceptually to generate all traces first and then start evaluating the traces.

AveshCSingh · 2025-12-08T17:06:02Z

Yes I see your point. I've filed https://databricks.atlassian.net/browse/ML-60284.

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

AveshCSingh · 2025-12-08T17:12:20Z

@B-Step62 , should I be able to merge this PR myself now that #19241 is merged? I still see 2 failing checks:

Maintainer approval
Protect

and do not see a merge button.

Initial impl

3703c13

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

AveshCSingh added 2 commits December 4, 2025 22:26

bug fix

f743cf0

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

rename

08dcc2c

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

AveshCSingh changed the title ~~[wip] Parallelize multi-turn session evaluation~~ Parallelize multi-turn session evaluation Dec 4, 2025

github-actions Bot added v3.7.0 area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 4, 2025

AveshCSingh force-pushed the parallelize-multi-turn-eval branch from 7472162 to 08dcc2c Compare December 4, 2025 23:13

AveshCSingh removed the v3.7.0 label Dec 4, 2025

AveshCSingh requested a review from smoorjani December 4, 2025 23:15

AveshCSingh removed the rn/none List under Small Changes in Changelogs. label Dec 4, 2025

Fix import order in harness.py after renaming function

f335aab

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

github-actions Bot added the v3.7.0 label Dec 4, 2025

AveshCSingh added the rn/bug-fix Mention under Bug Fixes in Changelogs. label Dec 5, 2025

AveshCSingh force-pushed the parallelize-multi-turn-eval branch from e38c3d9 to f335aab Compare December 5, 2025 01:30

github-actions Bot added v3.7.1 and removed v3.7.0 labels Dec 5, 2025

smoorjani requested changes Dec 5, 2025

View reviewed changes

AveshCSingh requested a review from smoorjani December 6, 2025 00:51

smoorjani approved these changes Dec 8, 2025

View reviewed changes

AveshCSingh added 2 commits December 8, 2025 17:08

properly fix test

41435c6

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

pr comments

bf2ac8c

Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>

AveshCSingh force-pushed the parallelize-multi-turn-eval branch from c2ada07 to bf2ac8c Compare December 8, 2025 17:34

AveshCSingh enabled auto-merge December 10, 2025 22:02

AveshCSingh added this pull request to the merge queue Dec 11, 2025

Merged via the queue into mlflow:master with commit d496198 Dec 11, 2025
65 of 67 checks passed

AveshCSingh deleted the parallelize-multi-turn-eval branch December 11, 2025 16:33

Conversation

AveshCSingh commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions Bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

smoorjani Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

smoorjani Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smoorjani Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

AveshCSingh commented Dec 6, 2025

Uh oh!

smoorjani commented Dec 8, 2025

Uh oh!

AveshCSingh commented Dec 8, 2025

Uh oh!

AveshCSingh commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AveshCSingh commented Dec 4, 2025 •

edited

Loading

github-actions Bot commented Dec 4, 2025 •

edited

Loading