Parallelize multi-turn session evaluation#19222
Conversation
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
|
Documentation preview for bf2ac8c is available at: More info
|
7472162 to
08dcc2c
Compare
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
e38c3d9 to
f335aab
Compare
smoorjani
left a comment
There was a problem hiding this comment.
left a few style nits (e.g., don't need a lot of the one liner comments), but my biggest question is whether we can submit single-turn and multi-turn evaluation tasks at the same time or does it have to be single-turn first?
| progress_bar.update(1) | ||
|
|
||
| # Phase 2: Submit and complete multi-turn tasks (after single-turn) | ||
| # We run multi-turn scorers after single-turn, since single-turn scorers may create new |
There was a problem hiding this comment.
Not sure I understand this comment - I thought we expect all traces are already generated (i.e., static dataset)?
There was a problem hiding this comment.
That isn't the case for EvaluationDatasets, where the trace is only linked from the EvalItem. _run_single will create minimal traces in this case, and multi-turn assessments will be logged on these traces.
When you evaluate a dataset withouth a predict_fn, minimal traces will be created from the request/response of the source traces.
There was a problem hiding this comment.
Not a blocker for this PR, but do you think it's worth decomposing this? Step 1: generate traces for all rows, step 2: evaluate all traces
| session_groups = defaultdict(list) | ||
|
|
||
| for item in eval_items: | ||
| if not getattr(item, "trace", None): |
There was a problem hiding this comment.
nit: why did we change this code? it seems functionally the same but the first iteration was cleaner
There was a problem hiding this comment.
The key difference is here:
if not session_id and item.source is not None:
session_id = item.source.source_data.get("session_id")
When the eval_item is from an evaluation dataset, EvalItem.trace will be None at the time when group_traces_by_session is called (code pointer). Instead we need to look for the session ID within the source_metadata which gets populated when merging records into the dataset (code pointer).
Why didn't we need this change before this PR? Well previously we did not care about how many sessions there were since the progress bar did not track session-level scorers. Now that it does, we must count the sessions before _run_single populates EvalItem.trace with minimal traces.
| if progress_bar: | ||
| progress_bar.close() | ||
|
|
||
| # Log multi-turn assessments to traces |
There was a problem hiding this comment.
nit: maybe don't need some of these one-liner comments
There was a problem hiding this comment.
Agreed --I've removed a bunch of these.
The reason I opted to do single turn first and then multi-turn in this PR is because with evaluation datasets, the single-turn pass creates minimal traces to log assessments to. Multi-turn scorers must log assessments to these same trces, so they must be created prior to running multi-turn evaluation. An alternative approach here is to: Given that (1) calls |
Not a blocker for this PR, but can you file a ticket for the alternative approach? IMO it's confusing to think of evaluation datasets separately than the other case (there are two cases in this PR where we had to add some code complexity). It seems much cleaner conceptually to generate all traces first and then start evaluating the traces. |
|
Yes I see your point. I've filed https://databricks.atlassian.net/browse/ML-60284. |
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
Signed-off-by: Avesh Singh <aveshcsingh@gmail.com>
c2ada07 to
bf2ac8c
Compare
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
What changes are proposed in this pull request?
Updates
mlflow.genai.evaluateto run multi-turn scorers in parallel, and to include their execution in the progress bar.How is this PR tested?
Manual validation: Ran script, got output:
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
Updates
mlflow.genai.evaluateto run multi-turn scorers in parallel, and to include their execution in the progress bar.What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.