Skip to content

ENH/RFC: Add masking #72

@derb12

Description

@derb12

I was against adding masking on a package-wide basis back in #41, but I think, if handled carefully, it would actually be a nice addition. This issue will detail how it will be implemented, and why. A tl;dr is that Baseline objects will get a new mask property that allows masking for all weighted methods in pybaselines and will otherwise raise an exception for methods that don't allow weighting if a mask is specified.

I'm open to suggestions/comments.

Specifying a mask

There are two options for how users could specify a mask: (1) put NaNs within their data and expect the mask to be built off of that, or (2) explicitly input a mask:

  1. This is how astropy handles it with their convolution (although they also allow inputting mask). In the context of baseline correction, usually the existing data is not actually nan, just a problematic region for calculating the baseline, so requiring users to replace their data with nan and then call the baseline method seems clunky.

  2. Masks could be potentially specified as (2a) inputs within individual methods or (2b) at the Baseline level.

  • 2a) Adding a mask=None to every single method seems excessive, especially since masking will mostly be internally handled within _Algorithm._register, so most of the methods will not even see the mask. I would concede that this is the more intuitive way to handle masks, but it's not as nice to internally use. Plus, for the methods that already have weights, it would be a bit unclear to the user if both parameters need to be specified or how they interact, which I think is a bigger reason to implement it on the Baseline level.
  • 2b) Add a mask=None to Baseline's initialization. If no mask is given, nothing changes. It could also be added as a Baseline property so that adding/updating/removing a mask is possible, which would be more user friendly like (2a) rather than having to create a new Baseline object for each mask; would need to note that updating the mask is not thread safe. I can kinda justify specifying a single mask for a Baseline object when considering that many times, you'd want to mask out a region from a problematic detector, which would in theory affect all data in the same way.

The main issue with implementing (2) is what should mask values mean? True/1 values could mean mask this out, which is how NumPy's MaskedArray works, as well as how astropy and other libraries that deal with masking like marray handle masks. It feels slightly counterintuitive to expectations when considering the mask as weights where 1/True means "fit this" and 0/False means "ignore this", but being consistent with the larger community would be better in the long run. Really just needs to be documented clearly, where mask creation could be related to the output of np.isnan. Internally, I'd need both the mask and its inverse, so it's just a matter of semantics of what form users give.

The other issue I see with (2b) is that methods that allow batched calculations (I think currently just collab_pls) will only be able to use one mask across all the datasets. It's frankly such a niche use case that I'm fine with that limitation... Besides, at least for collab_pls, it wouldn't make sense to try and apply more than one mask since they're all supposed to have the same basleine.

I think (2b) is the best option to use within pybaselines. I'd rather keep masking support fairly simple, so no additional nan_policy or nan_treatment values; if a mask is given, it will be used to provide a valid weighted interpolation when possible, as detailed below; if the user prefers those values be NaN after baseline correction, that can be handled by them using the output. This also simplifies usage internally for optimizer-based methods.

Applying masks

I detailed how the different baseline algorithms handle masking in issue #41, but just to reiterate it here: all methods in pybaselines roughly fall into 3 categories with how they handle masks:

A) Methods that can directly use the mask as weights (or np.logical_not(mask) depending on what mask values mean as stated in (2) above), making them easy to integrate with the new masking support. Only non-iteratively-reweighted polynomials (ie. all but loess and quant_reg) are in this category. Some classifier methods fall into this category as well (I was wrong about this in the original issue), like maybe fabc, but in a harder-to-handle way.
B) Methods that use iterative reweighting such that their weight functions need to be made mask-aware. Includes all Whittaker and P-spline methods, quant_reg and loess. Actually much easier to implement masking than I originally thought, mostly just need to wrap all the weighting functions with something like the decorator below and then masking's supported (more or less; things like loess will need more careful handling...).

weighting decorator
from functools import wraps

def masked_weighting(weighting_func):

    @wraps(weighting_func)
    def inner(y, baseline, *, fit_mask=None, **kwargs):
        no_mask = fit_mask is None
        if no_mask:
            input_y = y
            input_baseline = baseline
        else:
            input_y = y[fit_mask]
            input_baseline = baseline[fit_mask]

        output = weighting_func(input_y, input_baseline, **kwargs)
        if no_mask:
            full_output = output
        else:
            if isinstance(output, tuple):
                output_weights, *other = output
            else:
                output_weights = output
                other = ()
            full_weights = np.zeros(y.shape)
            full_weights[fit_mask] = output_weights
            if other:
                full_output = (full_weights, *other)
            else:
                full_output = full_weights

        return full_output

    return inner


# then implemented as such within _weighting.py
@masked_weighting
def _asls(y, baseline, p):
    ...

# and called like so, e.g. within the asls method
new_weights = _weighting._asls(y, baseline, p=p, fit_mask=~self.mask)  # or self.mask depending on mask values

C) All others, which don't have an explicit way to integrate masking (in a mathematically valid sense). I'd prefer to just raise an exception for them since doing something like interpolation of the input data would be a bit misleading to internally do. In the masking example in the docs, I discuss how to use interpolation before doing baseline correction for these type of methods, so I think users can just do that if they really want to use masking for these; the key is that it's an explicit choice on the user's end rather than a hidden implementation detail within pybaselines. Technically, I could use astropy's convolution to properly interpolate values in methods that use convolution, and maybe use bottleneck's various nan-aware functions for methods that use rolling windows, but it's just beyond what I want to support, at least right now. Would maybe be a nice new example though to show how to implement one of the simpler baseline algorithms in this category to correctly incorporate a mask.

I like the error-by-default mask handling, since it removes the pressure from needing to ensure mask support for new algorithms, which would otherwise be a fairly high self-imposed barrier. This way, I can gradually add support, if possible, at my own pace.

Technically optimizers are a fourth category, but they should mostly be able to just ignore masks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions