Conversation
|
|
||
| return self._drop_vars(labels, errors=errors) | ||
| else: | ||
| if utils.is_scalar(labels): |
There was a problem hiding this comment.
I don't think I fully understand the use case where dims is not None, where apparently you are allowed to feed into labels a whole DataArray. It also makes me nervous that there's a single unit test, not together with the other tests for Dataset.drop, that triggers the use case.
There was a problem hiding this comment.
Isn't this for the case you pass a single label; what's the case with passing a whole DataArray?
In [12]: da = xr.DataArray(data=np.ones((2, 3)), dims=['x', 'y'], coords=dict(x=list('ab')))
In [13]: da
Out[13]:
<xarray.DataArray (x: 2, y: 3)>
array([[1., 1., 1.],
[1., 1., 1.]])
Coordinates:
* x (x) <U1 'a' 'b'
Dimensions without coordinates: y
In [14]: da.drop('a', dim='x')
Out[14]:
<xarray.DataArray (x: 1, y: 3)>
array([[1., 1., 1.]])
Coordinates:
* x (x) <U1 'b'
Dimensions without coordinates: yThere was a problem hiding this comment.
This is the problem. The test is expecting ds.y to be cast to a generic array-like by drop.
xarray/xarray/tests/test_dataset.py
Line 4245 in dba85bd
There was a problem hiding this comment.
The fact is that iter(DataArray) yields scalar DataArrays, not the elements of the DataArray (unlike iter(numpy.ndarray))
So the lines
if isinstance(labels, str) or not isinstance(labels, Iterable):
labels = {labels}
else:
labels = set(labels)fall apart when labels is a DataArray.
To make it work I should change it to something like:
if isinstance(labels, str) or not isinstance(labels, Iterable):
labels = {labels}
elif isinstance(labels, DataArray):
labels = labels.values
else:
labels = set(labels)The problem is that the logic above is EVERYWHERE. None of the many, many functions that expect "one or more dimension names" as an input cope (or are tested against) a DataArray input.
I lean towards removing the option entirely, and changing the unit test from
ds.drop(ds.y, dim='y')to
ds.drop(ds.y.values, dim='y')This is also because passing a DataArray is incredibly counter-intuitive; it took me a long time to figure out that ds.drop(ds.y, dim='y') is not trying to drop the y variable.
There was a problem hiding this comment.
Note that I wrote labels = labels.values and not labels = set(labels.values). That would cause a huge performance hit when values has many elements.
There was a problem hiding this comment.
I think the problem is due to the fact that the same method drop() performs two fundamentally different things - drop variables or elements from the variables - and in the latter case there are plenty of similar methods that DO accept a DataArray, e.g. sel(), where(), etc.
There was a problem hiding this comment.
I think I cracked it - please review
There was a problem hiding this comment.
Yes, it would be ideal to have separate functions for dropping variables rather than elements along an array dimension.
We are also considering refactoring this function over in #3128, I would be very interested in your thoughts!
|
Like a machine! |
Living up to the headshot; a rampage of typing... Thanks @crusaderky ! |
|
Hello @crusaderky! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-08-05 22:11:48 UTC |
|
Ready for review and merge |
max-sixty
left a comment
There was a problem hiding this comment.
Great! Thanks for being so thorough.
I put added some small questions, e.g. around is_scalar. I also added a couple of comments we can elide for now.
Assuming we're good on is_scalar, I'll go ahead and merge.
| # Don't cast to set, as it would harm performance when labels | ||
| # is a large numpy array | ||
| if utils.is_scalar(labels): | ||
| labels = [labels] |
There was a problem hiding this comment.
Minor, no need to change now: could we remove the identical operation from the DataArray wrapper of this at https://github.com/pydata/xarray/pull/3177/files#diff-ffd3597671590bab245b3193affa62b8R1798 ?
| raise ValueError('errors must be either "raise" or "ignore"') | ||
|
|
||
| if utils.is_scalar(drop_dims): | ||
| if isinstance(drop_dims, str) or not isinstance(drop_dims, Iterable): |
There was a problem hiding this comment.
What's the difference between this and is_scalar? I had thought is_scalar was attempting to centralize this logic in a single function
There was a problem hiding this comment.
is_scalar doesn't cope with non-string Hashables, e.g. enum.Enum. I'm uncomfortable touching it without going through the exercise of thoroughly reviewing everything that invokes it.
There was a problem hiding this comment.
Actually, rereading it, it does cope with Enum. However, most functions that today use if isinstance(drop_dims, str) or not isinstance(drop_dims, Iterable): won't transparently cope with 0-dimensional numpy arrays or, even worse, 0-dimensional DataArrays.
| ... | ||
|
|
||
| # Drop index labels along dimension | ||
| @overload # noqa: F811 |
There was a problem hiding this comment.
Great! You're leading our understanding of python typing here...
| return self._drop_vars(drop_vars) | ||
|
|
||
| def transpose(self, *dims): | ||
| def transpose(self, *dims: Hashable) -> 'Dataset': |
There was a problem hiding this comment.
One day (don't change now), we could have this as returning its class, for those who inherit from Dataset. And I'm not sure how to do that with python typing; in other languages it's -> Self
There was a problem hiding this comment.
You can do this in Python by defining a type variable, T = TypeVar('T') and using it to annotate self:
def transpose(self: T, *dims: Hashable) -> T:There was a problem hiding this comment.
It is possible, yes, but IMHO detracts from readibility. The question here is, do people actually subclass from DataArray and Dataset? I've always found xarray rather hostile to subclassing, chiefly due to the heavy interaction between DataArray and Dataset.
| elif dim is None: | ||
| if dim is None: | ||
| dims = set(self.dims) | ||
| elif isinstance(dim, str) or not isinstance(dim, Iterable): |
There was a problem hiding this comment.
Same here re this vs is_scalar
|
|
||
| with raises_regex(ValueError, 'cannot be found'): | ||
| arr.drop(None) | ||
| arr.drop('w') |
There was a problem hiding this comment.
What happens if None is passed? My quick test suggest this test should still pass, is that right?
There was a problem hiding this comment.
None technically works, but it's not a valid use case. hash(None) works and isinstance(None, Hashable) returns True, but mypy will complain if you pass None where a Hashable is required, as None is treated as a special case. Also, xarray uses None everywhere to mark optional parameters, and in this case it asks the user to explicitly provide a value.
|
|
||
| # DataArrays as labels are a nasty corner case as they are not | ||
| # Iterable[Hashable] - DataArray.__iter__ yields scalar DataArrays. | ||
| actual = data.drop(DataArray(['a', 'b', 'c']), 'x', errors='ignore') |
There was a problem hiding this comment.
Great, it's good this now works
|
Ready for merge if everybody's happy with the current state |
|
OK, let's merge and we can iterate if needed from here. Thanks as ever @crusaderky @shoyer - do you have any thoughts on |
* master: pyupgrade one-off run (pydata#3190) mfdataset, concat now support the 'join' kwarg. (pydata#3102) reduce the size of example dataset in dask docs (pydata#3187) add climpred to related-projects (pydata#3188) bump rasterio to 1.0.24 in doc building environment (pydata#3186) More annotations (pydata#3177) Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117) Internal clean-up of isnull() to avoid relying on pandas (pydata#3132) Call darray.compute() in plot() (pydata#3183) BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
* master: pyupgrade one-off run (pydata#3190) mfdataset, concat now support the 'join' kwarg. (pydata#3102) reduce the size of example dataset in dask docs (pydata#3187) add climpred to related-projects (pydata#3188) bump rasterio to 1.0.24 in doc building environment (pydata#3186) More annotations (pydata#3177) Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117) Internal clean-up of isnull() to avoid relying on pandas (pydata#3132) Call darray.compute() in plot() (pydata#3183) BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
No description provided.