ENH/RFC: Add masking

I was against adding masking on a package-wide basis back in #41, but I think, if handled carefully, it would actually be a nice addition. This issue will detail how it will be implemented, and why. A tl;dr is that `Baseline` objects will get a new `mask` property that allows masking for all weighted methods in pybaselines and will otherwise raise an exception for methods that don't allow weighting if a mask is specified.

I'm open to suggestions/comments.


Specifying a mask
-----------------

There are two options for how users could specify a mask: (1) put NaNs within their data and expect the mask to be built off of that, or (2) explicitly input a mask:

1) This is how astropy handles it with their [convolution](https://docs.astropy.org/en/stable/api/astropy.convolution.convolve.html#astropy.convolution.convolve) (although they also allow inputting `mask`). In the context of baseline correction, usually the existing data is not actually nan, just a problematic region for calculating the baseline, so requiring users to replace their data with nan and then call the baseline method seems clunky.

2) Masks could be potentially specified as (2a) inputs within individual methods or (2b) at the Baseline level. 
  - 2a) Adding a `mask=None` to every single method seems excessive, especially since masking will mostly be internally handled within `_Algorithm._register`, so most of the methods will not even see the mask. I would concede that this is the more intuitive way to handle masks, but it's not as nice to internally use. Plus, for the methods that already have `weights`, it would be a bit unclear to the user if both parameters need to be specified or how they interact, which I think is a bigger reason to implement it on the Baseline level.
  - 2b) Add a `mask=None` to `Baseline`'s initialization. If no mask is given, nothing changes. It could also be added as a Baseline property so that adding/updating/removing a mask is possible, which would be more user friendly like (2a) rather than having to create a new Baseline object for each mask; would need to note that updating the mask is not thread safe. I can kinda justify specifying a single mask for a `Baseline` object when considering that many times, you'd want to mask out a region from a problematic detector, which would in theory affect all data in the same way.

The main issue with implementing (2) is what should mask values mean? True/1 values could mean mask this out, which is how NumPy's `MaskedArray` works, as well as how astropy and other libraries that deal with masking like [marray](https://github.com/mdhaber/marray) handle masks. It feels slightly counterintuitive to expectations when considering the mask as weights where 1/True means "fit this" and 0/False means "ignore this", but being consistent with the larger community would be better in the long run. Really just needs to be documented clearly, where mask creation could be related to the output of `np.isnan`. Internally, I'd need both the mask and its inverse, so it's just a matter of semantics of what form users give.

The other issue I see with (2b) is that methods that allow batched calculations (I think currently just `collab_pls`) will only be able to use one mask across all the datasets. It's frankly such a niche use case that I'm fine with that limitation... Besides, at least for `collab_pls`, it wouldn't make sense to try and apply more than one mask since they're all supposed to have the same basleine.

I think (2b) is the best option to use within pybaselines. I'd rather keep masking support fairly simple, so no additional `nan_policy` or `nan_treatment` values; if a mask is given, it will be used to provide a valid weighted interpolation when possible, as detailed below; if the user prefers those values be NaN after baseline correction, that can be handled by them using the output. This also simplifies usage internally for optimizer-based methods.


Applying masks
--------------

I detailed how the different baseline algorithms handle masking in issue #41, but just to reiterate it here: all methods in pybaselines roughly fall into 3 categories with how they handle masks:

A) Methods that can directly use the mask as `weights` (or `np.logical_not(mask)` depending on what mask values mean as stated in (2) above), making them easy to integrate with the new masking support. Only non-iteratively-reweighted polynomials (ie. all but `loess` and `quant_reg`) are in this category. **Some** classifier methods fall into this category as well (I was wrong about this in the original issue), like maybe `fabc`, but in a harder-to-handle way.
B) Methods that use iterative reweighting such that their weight functions need to be made mask-aware. Includes all Whittaker and P-spline methods, `quant_reg` and `loess`. Actually much easier to implement masking than I originally thought, mostly just need to wrap all the weighting functions with something like the decorator below and then masking's supported (more or less; things like `loess` will need more careful handling...).

<details>
<summary>weighting decorator</summary>

```python
from functools import wraps

def masked_weighting(weighting_func):

    @wraps(weighting_func)
    def inner(y, baseline, *, fit_mask=None, **kwargs):
        no_mask = fit_mask is None
        if no_mask:
            input_y = y
            input_baseline = baseline
        else:
            input_y = y[fit_mask]
            input_baseline = baseline[fit_mask]

        output = weighting_func(input_y, input_baseline, **kwargs)
        if no_mask:
            full_output = output
        else:
            if isinstance(output, tuple):
                output_weights, *other = output
            else:
                output_weights = output
                other = ()
            full_weights = np.zeros(y.shape)
            full_weights[fit_mask] = output_weights
            if other:
                full_output = (full_weights, *other)
            else:
                full_output = full_weights

        return full_output

    return inner


# then implemented as such within _weighting.py
@masked_weighting
def _asls(y, baseline, p):
    ...

# and called like so, e.g. within the asls method
new_weights = _weighting._asls(y, baseline, p=p, fit_mask=~self.mask)  # or self.mask depending on mask values

```
</details>

C) All others, which don't have an explicit way to integrate masking (in a mathematically valid sense). I'd prefer to just raise an exception for them since doing something like interpolation of the input data would be a bit misleading to internally do. In the masking example in the docs, I discuss how to use interpolation before doing baseline correction for these type of methods, so I think users can just do that if they really want to use masking for these; the key is that it's an explicit choice on the user's end rather than a hidden implementation detail within pybaselines. Technically, I could  use astropy's convolution to properly interpolate values in methods that use convolution, and maybe use [bottleneck's](https://github.com/pydata/bottleneck) various nan-aware functions for methods that use rolling windows, but it's just beyond what I want to support, at least right now. Would maybe be a nice new example though to show how to implement one of the simpler baseline algorithms in this category to correctly incorporate a mask.

I like the error-by-default mask handling, since it removes the pressure from needing to ensure mask support for new algorithms, which would otherwise be a fairly high self-imposed barrier. This way, I can gradually add support, if possible, at my own pace.

Technically optimizers are a fourth category, but they should *mostly* be able to just ignore masks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/RFC: Add masking #72

Specifying a mask

Applying masks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ENH/RFC: Add masking #72

Description

Specifying a mask

Applying masks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions