Conversation
|
CI failing for linting issues |
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks. A few questions / comments.
| avg = X.mean(axis=0).values | ||
| elif self.strategy == "median": | ||
| avg = X.quantile().values | ||
| avg = [np.median(X[col].dropna()) for col in X.columns] |
There was a problem hiding this comment.
I believe this will eagerly compute the values, thanks to np.median. Since that's done in a list comprehension, we'd end up executing the graph for X once per column. We want to delay computation till the end.
I also think this will end up pulling all the data for a column into a single ndarray, to do the median, which we also want to avoid.
There was a problem hiding this comment.
How about using delayed here?
avg = [dask.delayed(np.median(X[col].dropna())) for col in X.columns]| for col in X.columns: | ||
| val_counts = X[col].value_counts().reset_index() | ||
| if isinstance(X, dd.DataFrame): | ||
| x = val_counts.to_dask_array(lengths=True) |
There was a problem hiding this comment.
Do we need lengths here? This also triggers a computation.
There was a problem hiding this comment.
This is needed to compute chunk sizes ... any suggestion on how to avoid it? Thanks,
Fix #787.
Also related to #779