fix: prevent broadcasting errors in r2_score using da.where() by wietzesuijker · Pull Request #1013 · dask/dask-ml

wietzesuijker · 2025-03-03T20:17:13Z

Closes #1012

First PR here. Curious to hear your feedback.

Problem
After updating to Dask 2025.2.0, tests fail with a ValueError due to changes in chunk size handling.

Solution
Refactor r2_score() to use da.where() for correct broadcasting.

Testing
Test added to ensure r2_score() works correctly with arrays that have different chunk configurations.

dask_ml/ensemble/_blockwise.py

TomAugspurger · 2025-03-27T18:16:10Z

Thanks.

I'm not entirely sure what the best action is, but I think we ought to avoid anything that triggers computation unnecessarily, including len.

Can you say a bit more about getting n_samples is needed in blockwise?

wietzesuijker · 2025-03-29T21:35:45Z

Thanks @TomAugspurger. n_samples (now obtained via X.shape[0]) lets us determine the rechunking size without forcing computation. It splits the test data into one block per trained estimator, ensuring alignment and preventing broadcast errors. Combined with the da.where() update in r2_score, these changes maintain laziness and correct behavior with mismatched chunks.

TomAugspurger · 2025-03-30T15:35:12Z

I'm probably missing something, but why do we care that the size of the test dataset matches the size of the training dataset (_n_samples)? I'd expect us to just care that the number of samples in X_train and y_train to match, and separately that the number of samples in X_test, y_test match.

wietzesuijker · 2025-03-30T17:51:12Z

why do we care that the size of the test dataset matches the size of the training dataset (_n_samples)?

The goal is not for the test dataset to match the training dataset's overall size. The focus is ensuring each estimator, trained on a specific data block, receives a matching block from the test set. X.shape[0] is used (as n_samples) to compute the optimal test data block size, dividing the test set into blocks equal to the number of estimators. This aligns predictions and prevents broadcast errors, regardless of training and test dataset sizes.

Resolves "cannot broadcast shape (nan,) to shape (nan,)" errors.

wietzesuijker · 2025-05-05T14:51:44Z

@TomAugspurger I limited the PR to changes in dask_ml/metrics/regression.py. Merging this would unblock my use case and allow me to update dask. Thanks! :). (previous state)

wietzesuijker · 2025-05-06T21:56:46Z

Thanks for triggering the tests, Tom. Is there anything I can/should do to fix the failing runs? The errors seem unrelated and similar to other recent runs e.g. https://github.com/dask/dask-ml/actions/runs/14849752234/job/41691028477.

TomAugspurger · 2025-05-07T00:22:21Z

I'll take a look in the next few days.

TomAugspurger · 2025-05-10T13:23:26Z

Thanks @wietzesuijker. There's still one error in tests/preprocessing/test_data.py::TestQuantileTransformer::test_fit_transform_frame but the rest from #1012 are fixed. I'll fix that in another PR.

TomAugspurger reviewed Mar 4, 2025

View reviewed changes

dask_ml/ensemble/_blockwise.py Outdated Show resolved Hide resolved

wietzesuijker force-pushed the fix/broadcast-shape-nan branch from d7f7b86 to 17d3a02 Compare March 23, 2025 15:40

wietzesuijker changed the title ~~fix(ensemble, metrics): compute chunk sizes and refactor r2_score wit…~~ prevent broadcasting errors with unknown chunk sizes Mar 23, 2025

wietzesuijker requested a review from TomAugspurger March 27, 2025 14:35

wietzesuijker force-pushed the fix/broadcast-shape-nan branch from 17d3a02 to e98c538 Compare March 29, 2025 21:25

wietzesuijker changed the title ~~prevent broadcasting errors with unknown chunk sizes~~ fix: broadcast errors using lazy n_samples and da.where in r2_score Mar 29, 2025

fix: r2_score() to use da.where() for broadcasting

9c984a0

Resolves "cannot broadcast shape (nan,) to shape (nan,)" errors.

wietzesuijker force-pushed the fix/broadcast-shape-nan branch from e98c538 to 9c984a0 Compare May 5, 2025 14:42

wietzesuijker changed the title ~~fix: broadcast errors using lazy n_samples and da.where in r2_score~~ fix: prevent broadcasting errors in r2_score using da.where() May 5, 2025

TomAugspurger merged commit 6fdd1f4 into dask:main May 10, 2025
4 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent broadcasting errors in r2_score using da.where()#1013

fix: prevent broadcasting errors in r2_score using da.where()#1013
TomAugspurger merged 1 commit intodask:mainfrom
wietzesuijker:fix/broadcast-shape-nan

wietzesuijker commented Mar 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

TomAugspurger commented Mar 27, 2025

Uh oh!

wietzesuijker commented Mar 29, 2025

Uh oh!

TomAugspurger commented Mar 30, 2025

Uh oh!

wietzesuijker commented Mar 30, 2025

Uh oh!

wietzesuijker commented May 5, 2025 •

edited

Loading

Uh oh!

wietzesuijker commented May 6, 2025

Uh oh!

TomAugspurger commented May 7, 2025

Uh oh!

TomAugspurger commented May 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wietzesuijker commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Mar 27, 2025

Uh oh!

wietzesuijker commented Mar 29, 2025

Uh oh!

TomAugspurger commented Mar 30, 2025

Uh oh!

wietzesuijker commented Mar 30, 2025

Uh oh!

wietzesuijker commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wietzesuijker commented May 6, 2025

Uh oh!

TomAugspurger commented May 7, 2025

Uh oh!

TomAugspurger commented May 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wietzesuijker commented Mar 3, 2025 •

edited

Loading

wietzesuijker commented May 5, 2025 •

edited

Loading